Re: Pig and HARFileSystem

Mohnish Kodnani Thu, 27 Sep 2012 09:38:54 -0700

It would seem that when there is a wildcard in the last location in the
file path and when using Har file protocol, the combined paths are 0.
I get this when trying out the below given example .
2012-09-27 09:22:28,074 [main] INFO  org.apache.pig.tools.pigstats.
ScriptState - Pig features used in the script: LIMIT
2012-09-27 09:22:28,074 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
2012-09-27 09:22:28,147 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: x:
Store(hdfs://nn/tmp/temp1300843291/tmp-1282091819:org.apache.pig.impl.io.InterStorage)
- scope-2 Operator Key: scope-2)
2012-09-27 09:22:28,155 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-09-27 09:22:28,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-09-27 09:22:28,189 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-09-27 09:22:28,268 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-09-27 09:22:28,280 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-09-27 09:22:30,055 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-09-27 09:22:30,096 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-09-27 09:22:30,597 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-09-27 09:22:46,428 [Thread-6] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : *21667*
2012-09-27 09:22:46,431 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : *21667*
2012-09-27 09:22:46,440 [Thread-6] INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library
2012-09-27 09:22:46,443 [Thread-6] INFO
com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 335fea4fecb385745e9a6f2de174a5b26fbc6cae]
2012-09-27 09:24:04,257 [Thread-6] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - *Total
input paths (combined) to process : 0*


It seems that MapRedUtil returns 0 paths to process after it tries to find
proper splits.
Any suggestions ?


On Wed, Sep 26, 2012 at 9:29 PM, Mohnish Kodnani
<[email protected]>wrote:

> Hi,
> I had emailed the user mailing list regarding my problem but did not get
> much input, hence emailing the developer community.
> I have 2 questions about Pig and how it uses the HAR FileSystem.
>
> 1. It seems Path Globbing does not work with HAR Files with Pig, is this
> intentional ? For example :
> hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files
> in both har files. If I give the same path as input path to a pig script it
> does not seem to work.
>
> 2. Wildcards in HAR path.
>     Like the above example if I do the following on hadoop fs it works
> hadoop fs -ls har://x/y/*/a.har/*
> This lists all files from all folders that have a.har
>
> If I give the path input path to pig it does not work. I have tried these
> 2 things on pig 0.8
> Also, for the second use case. If I remove the last wild card where files
> should be, then it works.
> For example input path to pig :
> har://x/y/*/a.har/logFile
>
> then pig can read the file and give me records back, but wild card at the
> last location does not work.
>
> Any insights would be great around if this should or should not work. I
> have 30000 files in one folder inside the har, I cannot list each one and
> want to use wildcard as the last element in the path and use path globbing
> to provide multiple har files.
> Need to understand if this in Pig or Hadoop MR code base.
>
>
> Thanks
> Mohnish
>

Re: Pig and HARFileSystem

Reply via email to