[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

Cheolsoo Park (JIRA) Thu, 02 Jan 2014 09:51:03 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860392#comment-13860392
 ]


Cheolsoo Park commented on PIG-3642:
------------------------------------

[~azaroth], thank you for raising a concern. But I still think we should commit 
this patch for the following reasons-

# Fetch optimization happens after physical plan is fully built. If the plan is 
fetchable (i.e. meets all the conditions Lorand listed in the description), Pig 
will launch a job via FetchLauncher instead via MapReduceLauncher. Given this 
code path, I think the possibility of introducing a weird optimization bug is 
minimal. In addition, the optimization is only applicable to fairly small 
queries.
# There are indeed changes to some backend operators such as POStream. This is 
because the logic about when to pull data from pipeline is different in some 
cases. But these changes are fairly minimal too.
# IMO, the benefit of this optimization is big. I am constantly asked by users 
about this feature. True that it won't improve any performance of production 
ETL jobs, but it will shorten development iteration. In addition, launching a 
full MR job for a simple load/dump query definitely makes a bad impression to 
new users.






> Direct HDFS access for small jobs (fetch) 
> ------------------------------------------
>
>                 Key: PIG-3642
>                 URL: https://issues.apache.org/jira/browse/PIG-3642
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>             Fix For: 0.13.0
>
>         Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

Reply via email to