[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086534#comment-17086534
 ] 

Mark Hildreth commented on ARROW-8287:
--------------------------------------

I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec<u8>*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec<u8>*, which actually [broke some code in a test in the 
parquet 
crate.|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 This was a pretty complicated problem with someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].


{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
    |
833 | Standard.sample_iter(&mut rng).take(seed_len).collect();
    | ^^^^^^^^^^^ cannot infer type for `T`
 {code}
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
So, here are my blocking questions: * Stick with the original prettytable crate 
and just add the required type annotations in the Parquet test, or find another 
crate that doesn't have said side effect? I recommend finding a different one.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.

> [Rust] Arrow examples should use utility to print results
> ---------------------------------------------------------
>
>                 Key: ARROW-8287
>                 URL: https://issues.apache.org/jira/browse/ARROW-8287
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/6773] added a utility for printing 
> record batches and the DataFusion examples were updated to use this. We 
> should now do the same for the Arrow examples. This will require moving the 
> utility method from the datafusion crate to the arrow crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to