[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

Cheolsoo Park (JIRA) Mon, 20 Jan 2014 03:23:02 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876352#comment-13876352
 ]


Cheolsoo Park commented on PIG-3642:
------------------------------------

[~lbendig], you're right. I was reviewing Aniket's patch today and realized 
that these two patches are fairly independent of each other.

[~aniket486], after understanding your patch more, I agree with Lorand 
regarding the complexity. Besides, it makes mapper only jobs almost instant. I 
couldn't compare runtime between PIG-3463 and PIG-3642 because the current 
patch for PIG-3463 didn't work for mapper only jobs. However, I imagine it 
would be quite slower since it still launches local MR jobs, etc. So shall we 
commit both?

In fact, what really concerns me is that these optimizations make many tests 
run differently than before. For eg, many e2e tests that are running as MR jobs 
now can run as fetch jobs. That significantly changes our code coverage. So I'd 
like to explicitly disable these optimizations in all the existing e2e tests. 
It should be trivial to do via conf files. Do you agree?



> Direct HDFS access for small jobs (fetch) 
> ------------------------------------------
>
>                 Key: PIG-3642
>                 URL: https://issues.apache.org/jira/browse/PIG-3642
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>             Fix For: 0.13.0
>
>         Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

Reply via email to