It would seem that when there is a wildcard in the last location in the file path and when using Har file protocol, the combined paths are 0. I get this when trying out the below given example . 2012-09-27 09:22:28,074 [main] INFO org.apache.pig.tools.pigstats. ScriptState - Pig features used in the script: LIMIT 2012-09-27 09:22:28,074 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2012-09-27 09:22:28,147 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: x: Store(hdfs://nn/tmp/temp1300843291/tmp-1282091819:org.apache.pig.impl.io.InterStorage) - scope-2 Operator Key: scope-2) 2012-09-27 09:22:28,155 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-09-27 09:22:28,189 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2012-09-27 09:22:28,189 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2012-09-27 09:22:28,268 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-09-27 09:22:28,280 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-09-27 09:22:30,055 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-09-27 09:22:30,096 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-09-27 09:22:30,597 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-09-27 09:22:46,428 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : *21667* 2012-09-27 09:22:46,431 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : *21667* 2012-09-27 09:22:46,440 [Thread-6] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library 2012-09-27 09:22:46,443 [Thread-6] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 335fea4fecb385745e9a6f2de174a5b26fbc6cae] 2012-09-27 09:24:04,257 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - *Total input paths (combined) to process : 0*
It seems that MapRedUtil returns 0 paths to process after it tries to find proper splits. Any suggestions ? On Wed, Sep 26, 2012 at 9:29 PM, Mohnish Kodnani <[email protected]>wrote: > Hi, > I had emailed the user mailing list regarding my problem but did not get > much input, hence emailing the developer community. > I have 2 questions about Pig and how it uses the HAR FileSystem. > > 1. It seems Path Globbing does not work with HAR Files with Pig, is this > intentional ? For example : > hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files > in both har files. If I give the same path as input path to a pig script it > does not seem to work. > > 2. Wildcards in HAR path. > Like the above example if I do the following on hadoop fs it works > hadoop fs -ls har://x/y/*/a.har/* > This lists all files from all folders that have a.har > > If I give the path input path to pig it does not work. I have tried these > 2 things on pig 0.8 > Also, for the second use case. If I remove the last wild card where files > should be, then it works. > For example input path to pig : > har://x/y/*/a.har/logFile > > then pig can read the file and give me records back, but wild card at the > last location does not work. > > Any insights would be great around if this should or should not work. I > have 30000 files in one folder inside the har, I cannot list each one and > want to use wildcard as the last element in the path and use path globbing > to provide multiple har files. > Need to understand if this in Pig or Hadoop MR code base. > > > Thanks > Mohnish >
