[jira] [Commented] (DRILL-8023) Empty dict page breaks the "old" Parquet reader

ASF GitHub Bot (Jira) Tue, 18 Jan 2022 07:08:05 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477958#comment-17477958
 ]


ASF GitHub Bot commented on DRILL-8023:
---------------------------------------

lgtm-com[bot] commented on pull request #2430:
URL: https://github.com/apache/drill/pull/2430#issuecomment-1015503422


   This pull request **introduces 1 alert** when merging 
1781dfc3b444c07eb6d596a69628732927fedc17 into 
55e94c4e1c4a05ac7010391daea8f4f0804b0286 - [view on 
LGTM.com](https://lgtm.com/projects/g/apache/drill/rev/pr-09aabdde3d672d39317d77c0374f1eb95c0cbf67)
   
   **new alerts:**
   
   * 1 for Dereferenced variable may be null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Empty dict page breaks the "old" Parquet reader
> -----------------------------------------------
>
>                 Key: DRILL-8023
>                 URL: https://issues.apache.org/jira/browse/DRILL-8023
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>            Reporter: Alex Delgado
>            Assignee: James Turton
>            Priority: Major
>         Attachments: fastparquet_test.parquet.tar.gz, 
> pyarrow_test.parquet.tar.gz
>
>
> If the python libraries dask and pyarrow are used to export a dataframe to 
> parquet, and the parquet file has a column that is all null, this will cause 
> Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error.  Dask and Spark 
> are able to read the dask+pyarrow parquet files.
>  
> Example:
> Create the parquet files with and without pyarrow in python.
> {code:java}
> import pandas as pd
> import dask.dataframe as dd
> df = pd.DataFrame(
>     {
>         'A': [1, 2, 3],
>         'B': ['a', 'b', 'c'],
>         'C': [None, None, None]
>     }
> )
> ddf = dd.from_pandas(df, npartitions=1)
> ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow')
> ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
> {code}
> Read these parquet files with drill:
> {code:java}
> Apache Drill 1.19.0
> "Everything is easier with Drill."
> apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`;
> +---------------------+---+---+------+
> | __null_dask_index__ | A | B |  C   |
> +---------------------+---+---+------+
> | 0                   | 1 | a | null |
> | 1                   | 2 | b | null |
> | 2                   | 3 | c | null |
> +---------------------+---+---+------+
> 3 rows selected (0.179 seconds)
> apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`;
> Error: INTERNAL_ERROR ERROR: null
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 
> 75a796902ffe:31010](state=,code=0)
> {code}
> Narrow down to column that is causing the issue:
> {code:java}
> apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`;
> +---+---+
> | A | B |
> +---+---+
> | 1 | a |
> | 2 | b |
> | 3 | c |
> +---+---+
> 3 rows selected (0.145 seconds)
> apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`;
> Error: INTERNAL_ERROR ERROR: null
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] 
> (state=,code=0)
> {code}
> Dependency versions:
> {code:java}
> Apache Drill 1.19.0
> Python 3.9.7
> dask==2021.10.0
> pyarrow==6.0.0
> fastparquet==0.7.1
> {code}
> Attached are the parquet files I tested with.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-8023) Empty dict page breaks the "old" Parquet reader

Reply via email to