[ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876043#comment-13876043 ]
Cheolsoo Park commented on PIG-3642: ------------------------------------ Actually, discard my e2e results. I found some environment issues and am rerunning them. > Direct HDFS access for small jobs (fetch) > ------------------------------------------ > > Key: PIG-3642 > URL: https://issues.apache.org/jira/browse/PIG-3642 > Project: Pig > Issue Type: Improvement > Reporter: Lorand Bendig > Assignee: Lorand Bendig > Fix For: 0.13.0 > > Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch > > > With this patch I'd like to add the possibility to directly read data from > HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive > already has this feature (fetch). This patch shares some similarities with > the local mode of Pig 0.6. Here, fetching kicks off when the following holds > for a script: > * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, > (nested) FOREACH with expression operators, custom UDFs..etc > * no scalar aliases > * no SampleLoader > * single leaf job > * DUMP (no STORE) > The feature is enabled by default and can be toggled with: > * -N or -no_fetch > * set opt.fetch true/false; > There's no STORE support because I wanted to make it explicit that this > "optimization" is for launching small/simple scripts during development, > rather than querying and filtering large number of rows on the client > machine. However, a threshold could be given on the input size (an > estimation) to determine whether to prefer fetch over MR jobs, similar to > what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's > LoadMetadata#getStatistic ?) -- This message was sent by Atlassian JIRA (v6.1.5#6160)