[ 
https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352013#comment-17352013
 ] 

Weston Pace commented on ARROW-12873:
-------------------------------------

I agree we will need metadata to travel with the batches.  I'll agree with 
[~apitrou] I'm not sufficiently convince we can't know it all ahead of time.  
You mention "since they may not originate from the arrow library".  Do you have 
an example of that?  I think `void*` is clearly justified if it is pass-thru 
information.  In other words if A) The source is external to Arrow AND B) The 
consumption / use of the metadata is external to Arrow.  I'm not sure that is 
the case here.

My only concern with void* is that it communicates nothing about what 
information should be put in there.  For example, if Arrow is going to use this 
information to optimize query plans then it would be useful for data producers 
to know exactly what format they should be creating so that they can have their 
data optimized appropriately.  Abstractions can be built up and simplified and 
refined.

With regards to filtering we already have the concept of expressions and 
simplification.  It seems pretty straightforward that an exec batch would have 
a partition expression associated with it.  If "partition expression" is not 
the right word than maybe "guarantee expression" or something of the sort.

 

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-12873
>                 URL: https://issues.apache.org/jira/browse/ARROW-12873
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for 
> tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, 
> since they may not originate from the arrow library. For an example within 
> the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and 
> a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} 
> of origin. However adding {{ExecBatch::fragment}} would result in a cyclic 
> dependency.
> To facilitate this tagging capability, we would need a type erased container 
> something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to