Re: Pig developer meeting in February

Dmitriy Ryaboy Thu, 27 Jan 2011 16:16:05 -0800

Ashutosh, where do we do that? I thought we did, too, but didn't find it
last time I looked. LoadPushDown has this:


    /**

     * Set of possible operations that Pig can push down to a loader.

     */

    enum OperatorSet {PROJECTION};


There is also this in LoadMetadata, but it is pretty explicit in the
comments about this being partition-specific. Are you saying that as long as
one claims every column as a partition, all filters will be pushed down?
Will the filters also be applied to the data the loader returns, even if the
loader accepts the expression? That would be useful for loaders that have
ability to apply probabilistic filters, for example.

       /**

     * Find what columns are partition keys for this input.

     * @param location Location as returned by

     * {@link LoadFunc#relativeToAbsolutePath(String,
org.apache.hadoop.fs.Path)}

     * @param job The {@link Job} object - this should be used only to
obtain

     * cluster properties through {@link Job#getConfiguration()} and not to
set/query

     * any runtime job information.

     * @return array of field names of the partition keys. Implementations

     * should return null to indicate that there are no partition keys

     * @throws IOException if an exception occurs while retrieving partition
keys

     */

    String[] getPartitionKeys(String location, Job job)

    throws IOException;


    /**

     * Set the filter for partitioning.  It is assumed that this filter

     * will only contain references to fields given as partition keys in

     * getPartitionKeys. So if the implementation returns null in

     * {@link #getPartitionKeys(String, Job)}, then this method is not

     * called by Pig runtime. This method is also not called by the Pig
runtime

     * if there are no partition filter conditions.

     * @param partitionFilter that describes filter for partitioning

     * @throws IOException if the filter is not compatible with the storage

     * mechanism or contains non-partition fields.

     */

    void setPartitionFilter(Expression partitionFilter) throws IOException;

On Thu, Jan 27, 2011 at 10:02 AM, Ashutosh Chauhan <hashut...@apache.org>wrote:

> What do you mean by true predicate pushdown? We hand over the full
> filter expression in that method to loader.  That I guess is
> sufficient info to push more processing at storage layer e.g. to do
> range queries in Hbase. Pig doesn't have any more information about
> filters then that to push, unless you want full logical plan.
>
> Ashutosh
> On Wed, Jan 26, 2011 at 18:04, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> > Right, we do partition filtering, but not true predicate pushdown.
> >
> > On Wed, Jan 26, 2011 at 5:59 PM, Daniel Dai <jiany...@yahoo-inc.com>
> wrote:
> >
> >> Are you talking about LoadMetadata.setPartitionFilter?
> >> PartitionFilterOptimizer will do that.
> >>
> >> Daniel
> >>
> >>
> >> Dmitriy Ryaboy wrote:
> >>
> >>> I may be wrong but I think predicate pushdown is designed for, but not
> >>> actually implemented in the current LoadPushdown interface (you can
> only
> >>> push projections). If I am wrong, that's great.. but if not, that would
> be
> >>> an important feature to add, as people are trying to connect Pig to
> >>> "smart"
> >>> storage systems like rdbmses, HBase, and Cassandra more and more.  I
> think
> >>> we only kind of simulate this with partition keys info, which is not
> >>> always
> >>> sufficient
> >>>
> >>> D
> >>>
> >>> On Wed, Jan 26, 2011 at 2:41 PM, Julien Le Dem <led...@yahoo-inc.com>
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> If making Pig Thread safe (i.e.: two threads running a different pig
> >>>> script) is important then we need to change some of the APIs from
> static
> >>>> singleton access to a dependency injection pattern.
> >>>> In that case, this should probably be done before 1.0
> >>>> For example: UDFContext should be passed to the UDF after construction
> >>>> (similar to the SevrletContext in Servlet or the way Hadoop passes the
> >>>> context to tasks)
> >>>> Also a clearly separated API that does not depend on the Pig
> >>>> implementation
> >>>> would help.
> >>>> For example UDFContext is in org.apache.pig.impl.util when it would be
> >>>> better in org.apache.pig.api (Or at least an interface defining it)
> >>>>
> >>>> Julien
> >>>>
> >>>> On 1/24/11 10:14 AM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
> >>>>
> >>>> Hi Guys,
> >>>>
> >>>> I think it is time for us to have another meeting. Yahoo would be
> happy
> >>>> to
> >>>> host if this works for everybody. How about Wednesday, 2/9 4-6 pm.
> >>>> Please,
> >>>> let us know if you are planning to attend and if the date/time works
> for
> >>>> you.
> >>>>
> >>>> Things that come to mind to discuss and as always feel free to suggest
> >>>> others:
> >>>>
> >>>> -          Error handling proposal - this might be easier to finalize
> >>>> face-to-face
> >>>> -          Pig 0.9 plan
> >>>> -          Pig Roadmap beyond 0.9
> >>>> o        What do we want to do in Pig.next?
> >>>> o        Are we ready for Pig 1.0
> >>>>
> >>>> Olga
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Pig developer meeting in February

Reply via email to