Thank you Rohan, I really appreciate your help! I'll give it shot and post back if it works.
~Ed On Mon, Sep 27, 2010 at 11:51 PM, Rohan Rai <rohan....@inmobi.com> wrote: > Just corrected/tested and pushed LzoTokenizedLoader to the personal fork > > Hopefully it works now > > > Regards > Rohan > > Dmitriy Ryaboy wrote: > >> lzop should work. >> >> On Mon, Sep 27, 2010 at 10:59 AM, Rohan Rai <rohan....@inmobi.com> wrote: >> >> >> Well >>> >>> I haven't tried (rather I don't remember) compressing via lzop and then >>> putting on cluster... >>> So cant tell you about that...Here is what works for me. >>> >>> I do it by first putting the file on cluster and then doing Stream >>> Compression. >>> >>> And yes it need not be indexed (I guess it doesn't matter for small >>> test file, otherwise it is unwise >>> for one loses the benefit of parallelism) >>> >>> Regards >>> Rohan >>> >>> >>> pig wrote: >>> >>> >>> Hi Rohan, >>>> >>>> The test file (test_input_chars.txt.lzo) is not indexed. I created it >>>> using >>>> the command >>>> >>>> 'lzop test_input_chars.txt' >>>> >>>> It's a really small file (only 6 lines) so I didn't think it needed to >>>> be >>>> index. Do all files regardless of size need to be indexed for the >>>> LzoTokenizedLoader to work? >>>> >>>> Thank you! >>>> >>>> ~Ed >>>> >>>> On Mon, Sep 27, 2010 at 1:25 AM, Rohan Rai <rohan....@inmobi.com> >>>> wrote: >>>> >>>> >>>> Oh Sorry I am completely out of sync... >>>> >>>> Can you tell how did you lzo'ed and indexed the file >>>>> >>>>> >>>>> Regards >>>>> Rohan >>>>> >>>>> Rohan Rai wrote: >>>>> >>>>> >>>>> Oh Sorry I did not see this mail ... >>>>> >>>>> Its not an official patch/release >>>>>> >>>>>> But here is a fork on elephant-bird which works with pig 0.7 >>>>>> >>>>>> for normal LZOText Loading etc >>>>>> >>>>>> (NOt HbaseLoader) >>>>>> >>>>>> Regards >>>>>> Rohan >>>>>> >>>>>> Dmitriy Ryaboy wrote: >>>>>> >>>>>> The 0.7 branch is not tested.. it's quite likely it doesn't actually >>>>>> work >>>>>> >>>>>> :). >>>>>> >>>>>> Rohan Rai was working on it.. Rohan, think you can take a look and >>>>>>> help >>>>>>> Ed >>>>>>> out? >>>>>>> >>>>>>> Ed, you may want to check if the same input works when you use Pig >>>>>>> 0.6 >>>>>>> (and >>>>>>> the official elephant-bird, on Kevin Weil's github). >>>>>>> >>>>>>> -D >>>>>>> >>>>>>> On Thu, Sep 23, 2010 at 6:49 AM, pig <hadoopn...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> After getting all the errors to go away with LZO libraries not being >>>>>>> >>>>>>> found >>>>>>>> and missing jar files for elephant-bird I've run into a new problem >>>>>>>> when >>>>>>>> using the elephant-bird branch for pig 0.7 >>>>>>>> >>>>>>>> The following simple pig script works as expected >>>>>>>> >>>>>>>> REGISTER elephant-bird-1.0.jar >>>>>>>> REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar >>>>>>>> A = load '/usr/foo/input/test_input_chars.txt'; >>>>>>>> DUMP A; >>>>>>>> >>>>>>>> This just dumps out the contents of the test_input_chars.txt file >>>>>>>> which >>>>>>>> is >>>>>>>> tab delimited. The output looks like: >>>>>>>> >>>>>>>> (1,a,a,a,a,a,a) >>>>>>>> (2,b,b,b,b,b,b) >>>>>>>> (3,c,c,c,c,c,c) >>>>>>>> (4,d,d,d,d,d,d) >>>>>>>> (5,e,e,e,e,e,e) >>>>>>>> >>>>>>>> I then lzop the test file to get test_input_chars.txt.lzo (I >>>>>>>> decompressed >>>>>>>> this with lzop -d to make sure the compression worked fine and >>>>>>>> everything >>>>>>>> looks good). >>>>>>>> If I run the exact same script provided above on the lzo file it >>>>>>>> works >>>>>>>> fine. However, this file is really small and doesn't need to use >>>>>>>> indexes. >>>>>>>> As a result, I wanted to >>>>>>>> have LZO support that worked with indexes. Based on this I decided >>>>>>>> to >>>>>>>> try >>>>>>>> out the elephant-bird branch for pig 0.7 located here ( >>>>>>>> http://github.com/hirohanin/elephant-bird/) as >>>>>>>> recommended by Dimitriy. >>>>>>>> >>>>>>>> I created the following pig script that mirrors the above script but >>>>>>>> should >>>>>>>> hopefully work on LZO files (including indexed ones) >>>>>>>> >>>>>>>> REGISTER elephant-bird-1.0.jar >>>>>>>> REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar >>>>>>>> A = load '/usr/foo/input/test_input_chars.txt.lzo' USING >>>>>>>> com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t'); >>>>>>>> DUMP A; >>>>>>>> >>>>>>>> When I run this script which uses the LzoTokenizedLoader there is no >>>>>>>> output. The script appears to run without errors but there are zero >>>>>>>> Records >>>>>>>> Written and 0 Bytes Written. >>>>>>>> >>>>>>>> Here is the exact output: >>>>>>>> >>>>>>>> grunt > DUMP A; >>>>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader - >>>>>>>> LzoTokenizedLoader with given delimited [ ] >>>>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader - >>>>>>>> LzoTokenizedLoader with given delimited [ ] >>>>>>>> [main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader - >>>>>>>> LzoTokenizedLoader with given delimited [ ] >>>>>>>> [main] INFO >>>>>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine >>>>>>>> - >>>>>>>> (Name: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Store(hdfs://master:9000/tmp/temp-2052828736/tmp-1533645117:org.apache.pig.builtin.BinStorage) >>>>>>>> - 1-4 Operator Key: 1-4 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>>>>>>> - MR plan size before optimization: 1 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >>>>>>>> - MR plan size after optimization: 1 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >>>>>>>> - mapred.job.reduce.markreset.buffer.percent is not set, set to >>>>>>>> default >>>>>>>> 0.3 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >>>>>>>> - Setting up single store job >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - 1 map-reduce job(s) waiting for submission. >>>>>>>> [Thread-12] WARN org.apache.hadoop.mapred.JobClient - Use >>>>>>>> GenericOptionsParser for parsing the arguments. Applications should >>>>>>>> implement Tool for the same. >>>>>>>> [Thread-12] INFO >>>>>>>> com.twitter.elephantbird.pig.load.LzoTokenizedLoader >>>>>>>> - >>>>>>>> LzoTokenizedLoader with given delimiter [ ] >>>>>>>> [Thread-12] INFO >>>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat >>>>>>>> - >>>>>>>> Total input paths to process : 1 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - 0% complete >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - HadoopJobId: job_201009101108_0151 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - More information at >>>>>>>> http://master:50030/jobdetails.jsp?jobid=job_201009101108_0151 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - 50% complete >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - 100% complete >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - Succesfully stored result in >>>>>>>> "hdfs://amb-hadoop-01:9000/tmp/temp-2052828736/tmp-1533645117 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - Records written: 0 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - Bytes written: 0 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - Spillable Memory Manager spill count : 0 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - Proactive spill count : 0 >>>>>>>> [main] INFO >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >>>>>>>> - Success! >>>>>>>> [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - >>>>>>>> Total >>>>>>>> input paths to process: 1 >>>>>>>> [main] INFO >>>>>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - >>>>>>>> Total input paths to process: 1 >>>>>>>> grunt > >>>>>>>> >>>>>>>> I'm not sure if I'm doing something wrong in my use of >>>>>>>> LzoTokenizedLoader >>>>>>>> or >>>>>>>> if there is a problem with the class itself (most likely the problem >>>>>>>> is >>>>>>>> with >>>>>>>> my code heh) Thank you for any help! >>>>>>>> >>>>>>>> ~Ed >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> . >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The information contained in this communication is intended solely >>>>>>> for >>>>>>> >>>>>>> the >>>>>> use of the individual or entity to whom it is addressed and others >>>>>> authorized to receive it. It may contain confidential or legally >>>>>> privileged >>>>>> information. If you are not the intended recipient you are hereby >>>>>> notified >>>>>> that any disclosure, copying, distribution or taking any action in >>>>>> reliance >>>>>> on the contents of this information is strictly prohibited and may be >>>>>> unlawful. If you have received this communication in error, please >>>>>> notify us >>>>>> immediately by responding to this email and then delete it from your >>>>>> system. >>>>>> The firm is neither liable for the proper and complete transmission of >>>>>> the >>>>>> information contained in this communication nor for any delay in its >>>>>> receipt. >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> The information contained in this communication is intended solely >>>>>> for >>>>>> >>>>>> the >>>>> use of the individual or entity to whom it is addressed and others >>>>> authorized to receive it. It may contain confidential or legally >>>>> privileged >>>>> information. If you are not the intended recipient you are hereby >>>>> notified >>>>> that any disclosure, copying, distribution or taking any action in >>>>> reliance >>>>> on the contents of this information is strictly prohibited and may be >>>>> unlawful. If you have received this communication in error, please >>>>> notify >>>>> us >>>>> immediately by responding to this email and then delete it from your >>>>> system. >>>>> The firm is neither liable for the proper and complete transmission of >>>>> the >>>>> information contained in this communication nor for any delay in its >>>>> receipt. >>>>> >>>>> >>>>> . >>>>> >>>>> >>>> >>>> The information contained in this communication is intended solely for >>> the >>> use of the individual or entity to whom it is addressed and others >>> authorized to receive it. It may contain confidential or legally >>> privileged >>> information. If you are not the intended recipient you are hereby >>> notified >>> that any disclosure, copying, distribution or taking any action in >>> reliance >>> on the contents of this information is strictly prohibited and may be >>> unlawful. If you have received this communication in error, please notify >>> us >>> immediately by responding to this email and then delete it from your >>> system. >>> The firm is neither liable for the proper and complete transmission of >>> the >>> information contained in this communication nor for any delay in its >>> receipt. >>> >>> >>> . >> >> >> > > The information contained in this communication is intended solely for the > use of the individual or entity to whom it is addressed and others > authorized to receive it. It may contain confidential or legally privileged > information. If you are not the intended recipient you are hereby notified > that any disclosure, copying, distribution or taking any action in reliance > on the contents of this information is strictly prohibited and may be > unlawful. If you have received this communication in error, please notify us > immediately by responding to this email and then delete it from your system. > The firm is neither liable for the proper and complete transmission of the > information contained in this communication nor for any delay in its > receipt. >