[ 
https://issues.apache.org/jira/browse/PIG-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583602#comment-13583602
 ] 

Rohini Palaniswamy commented on PIG-3204:
-----------------------------------------

cat log4j.conf

{code}
# ***** Set root logger level to DEBUG and its only appender to A.
log4j.rootLogger=debug, A

# ***** A is set to be a ConsoleAppender.
log4j.appender.A=org.apache.log4j.ConsoleAppender
# ***** A uses PatternLayout.
log4j.appender.A.layout=org.apache.log4j.PatternLayout
log4j.appender.A.layout.ConversionPattern=%d [%t] %-5p %c %x - %m%n
{code}

cat simpleload.pig

{code}
A = LOAD '/tmp/data';
STORE A into '/tmp/out';
{code}

pig -log4jconf ~/pig/log4j.conf simpleload.pig

Doing
{code}
sed -n '/Pig features used in the script/,/getDelegationToken/p' /tmp/debug.log 
| grep getFileInfo | wc -l
{code}

gives 20 getFileInfo calls if /tmp/data is a directory and 35 calls if 
/tmp/data is a file. 

grep org.apache.pig.builtin.JsonMetadata /tmp/debug.log gives 10 statements of 
2013-02-21 22:04:41,096 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - 
Could not find schema file for /tmp/data


  Haven't stepped through the code, but based on the logs seems to be a good 
candidate for optimization to cut down on the number of FS calls. 


                
> Optimize the number of FS calls to get schema to cut down time before job 
> launch
> --------------------------------------------------------------------------------
>
>                 Key: PIG-3204
>                 URL: https://issues.apache.org/jira/browse/PIG-3204
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
>   Currently there are a lot of NN calls made to determine if there is a 
> schema file for a path in a LOAD statement. When there is a slow NN(caused by 
> whole bunch of other issues), it takes a lot of time for this and we found 
> the scripts spending anywhere from 5 mins to 40 mins depending upon the 
> script. It seems to be a good place for optimization. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to