[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

Andy Grove (Jira) Thu, 08 Oct 2020 10:50:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210358#comment-17210358
 ]


Andy Grove commented on ARROW-10226:
------------------------------------

Here is a test case to reproduce the issue. I uploaded the parquet file to 
dropbox. It is ~100MB.

[https://www.dropbox.com/s/6cpz1h9juxl4c7t/part-00000-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet?dl=0]

[~jorgecarleitao] Thanks for the offer of help. I don't know much time we 
should spend on this but if you have the time to take a look at least to 
confirm the test also fails for you, that would be an extra data point. 
{code:java}
#[test]
fn foo() {
    use arrow::array::Array;
    use crate::arrow::arrow_reader::ArrowReader;

    let file = std::fs::File::open(
        
"/mnt/tpch/debug/lineitem/part-00000-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet").unwrap();
    let file_reader = Rc::new(SerializedFileReader::new(file).unwrap());
    let metadata = file_reader
        .metadata
        .file_metadata()
        .key_value_metadata()
        .as_ref()
        .unwrap();


    let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
    let schema = arrow_reader.get_schema().unwrap();
    let projection = vec![4, 5, 6, 7, 8, 9, 10];
    let mut batch_reader =
        arrow_reader.get_record_reader_by_columns(projection, 40960).unwrap();

    while let Some(batch) = batch_reader.next() {
        let batch = batch.unwrap();

        let mut n = 0;
        match batch.column(4).as_any().downcast_ref::<StringArray>() {
            Some(l_returnflag) => {
                for i in 0..batch.num_rows() {
                    if l_returnflag.is_valid(i) {
                        if l_returnflag.value(i).len() > 1 {
                            n = n + 1;
                        }
                    }
                }
            }
            None => println!("l_returnflag is not a string")
        }
        println!("{} bad values in batch", n);
        assert_eq!(n, 0);
    }
}
 {code}

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10226
>                 URL: https://issues.apache.org/jira/browse/ARROW-10226
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>             Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

Reply via email to