Re: Loading CSV Files & LOAD large files behavior in local mode

Jeff Zhang Fri, 20 Aug 2010 00:40:57 -0700

Actually, the intermediate won't been stored in memory.  they will be
stored in a tmp directory o hdfs, and pig will help you clean up the
intermediate data when the job is finished.


Yes, BinStorage is a binary format for storing intermediate data and
know how to deserialize it to tuples

On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator
<[email protected]> wrote:
> Right, in cases where you have to load multiple large relations and then do
> some processing on each relations (filtering, aggregation) before joining
> them together.  One wouldn't want to have all of the relations and
> intermediate state in memory before the join.
>
> So is BinStorage just storing the Tuples in an internal binary format that
> is easily converted back to a Tuple when loaded (i.e. no csv parsing
> necessary)?
>
> Thanks.
>
> On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang <[email protected]> wrote:
>
>> What do you mean "multiple relations with many tuples" ? Do you mean
>> join multiple data set ?
>> And Pig user BinStorage for storing intermediate data.
>>
>>
>> On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator
>> <[email protected]> wrote:
>> > Thanks, Jeff.
>> >
>> > A quick follow-up question relating to the loading/storing of data - what
>> is
>> > the best practice when dealing with multiple relations with many tuples,
>> do
>> > people typically STORE intermediate relations to minimize memory usage
>> and
>> > RELOAD the intermediate data for use later on in the same script?
>>  Because I
>> > noticed that when tuples are written out using the TupleFormat, which
>> > outputs text with an additional parenthesis that would cause a subsequent
>> > PigStorage LOAD to get extra parenthesis characters, right?
>> >
>> > On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[email protected]> wrote:
>> >
>> >> I am afraid you should write your own LoadFunc to interpret the text.
>> >> From Pig 0.7, the local mode use the hadoop's standalone local mode,
>> >> so it will won't store all the data in memory, the data will been read
>> >> in stream mode, but this mode need more memory because each task is
>> >> executed in another jvm.
>> >>
>> >>
>> >> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator
>> >> <[email protected]> wrote:
>> >> > What loader should I use on csv files with quoted strings that contain
>> >> > embedded commas?  (i.e. Embedded commas should not be a separator.)
>> >> >
>> >> > And when LOADing large files in local mode, does Pig just store it all
>> >> > in memory?  Or does it have memory management ala buffer managers in
>> >> > DBMS's?
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards
>> >>
>> >> Jeff Zhang
>> >>
>> >
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>



-- 
Best Regards

Jeff Zhang

Re: Loading CSV Files & LOAD large files behavior in local mode

Reply via email to