[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

Cheolsoo Park (JIRA) Sun, 19 Jan 2014 12:16:59 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875995#comment-13875995
 ]


Cheolsoo Park commented on PIG-3642:
------------------------------------

I will leave the decision to Aniket and Lorand.

Just FYI- I have been running e2e tests, and I found many test failures even 
without this patch. So it was hard to tell whether this patch breaks any tests 
or not.

Here is the result in the current trunk (without this patch)-
{code}
[exec] Final results ,    PASSED: 536  FAILED: 22   SKIPPED: 24   ABORTED: 62   
FAILED DEPENDENCY: 0
{code}
We should fix these before things get worse.

> Direct HDFS access for small jobs (fetch) 
> ------------------------------------------
>
>                 Key: PIG-3642
>                 URL: https://issues.apache.org/jira/browse/PIG-3642
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>             Fix For: 0.13.0
>
>         Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

Reply via email to