Actually, the intermediate won't been stored in memory. they will be stored in a tmp directory o hdfs, and pig will help you clean up the intermediate data when the job is finished.
Yes, BinStorage is a binary format for storing intermediate data and know how to deserialize it to tuples On Fri, Aug 20, 2010 at 3:35 PM, Defenestrator <[email protected]> wrote: > Right, in cases where you have to load multiple large relations and then do > some processing on each relations (filtering, aggregation) before joining > them together. One wouldn't want to have all of the relations and > intermediate state in memory before the join. > > So is BinStorage just storing the Tuples in an internal binary format that > is easily converted back to a Tuple when loaded (i.e. no csv parsing > necessary)? > > Thanks. > > On Fri, Aug 20, 2010 at 12:06 AM, Jeff Zhang <[email protected]> wrote: > >> What do you mean "multiple relations with many tuples" ? Do you mean >> join multiple data set ? >> And Pig user BinStorage for storing intermediate data. >> >> >> On Fri, Aug 20, 2010 at 2:42 PM, Defenestrator >> <[email protected]> wrote: >> > Thanks, Jeff. >> > >> > A quick follow-up question relating to the loading/storing of data - what >> is >> > the best practice when dealing with multiple relations with many tuples, >> do >> > people typically STORE intermediate relations to minimize memory usage >> and >> > RELOAD the intermediate data for use later on in the same script? >> Because I >> > noticed that when tuples are written out using the TupleFormat, which >> > outputs text with an additional parenthesis that would cause a subsequent >> > PigStorage LOAD to get extra parenthesis characters, right? >> > >> > On Thu, Aug 19, 2010 at 1:50 AM, Jeff Zhang <[email protected]> wrote: >> > >> >> I am afraid you should write your own LoadFunc to interpret the text. >> >> From Pig 0.7, the local mode use the hadoop's standalone local mode, >> >> so it will won't store all the data in memory, the data will been read >> >> in stream mode, but this mode need more memory because each task is >> >> executed in another jvm. >> >> >> >> >> >> On Thu, Aug 19, 2010 at 12:48 AM, Defenestrator >> >> <[email protected]> wrote: >> >> > What loader should I use on csv files with quoted strings that contain >> >> > embedded commas? (i.e. Embedded commas should not be a separator.) >> >> > >> >> > And when LOADing large files in local mode, does Pig just store it all >> >> > in memory? Or does it have memory management ala buffer managers in >> >> > DBMS's? >> >> > >> >> >> >> >> >> >> >> -- >> >> Best Regards >> >> >> >> Jeff Zhang >> >> >> > >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang
