[jira] [Commented] (DRILL-8139) Data corruption and occasional segfaults querying Parquet/gzip under the async column reader and sync page reader

ASF GitHub Bot (Jira) Thu, 17 Feb 2022 00:50:04 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493775#comment-17493775
 ]


ASF GitHub Bot commented on DRILL-8139:
---------------------------------------

jnturton opened a new pull request #2463:
URL: https://github.com/apache/drill/pull/2463


   # [DRILL-8139](https://issues.apache.org/jira/browse/DRILL-8139): Data 
corruption and occasional segfaults querying Parquet/gzip under the async 
column reader and sync page reader
   
   ## Description
   
   The gzip codec objects returned by the Parquet lib's codec factory are not 
thread safe.  Here we work around the problem by creating, and later releasing, 
single-use codec factories for gzip.  Many codec factories can be created 
during the reading of a Parquet file containing gzip compressed column data 
which is unnatural and unfortunate but [the added overhead does appear to be 
small](https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java).
   
   Note: currently this PR is rebased onto #2460 since that is required for a 
clean test run.
   
   ## Documentation
   N/A
   
   ## Testing
   
   TestParquetWriter#testTPCHReadWriteDictGzip
   Manual testing, especially under the async column reader.
   A unit test that uses the async column reader is currently not possible 
because of DRILL-8138.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Data corruption and occasional segfaults querying Parquet/gzip under the 
> async column reader and sync page reader
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-8139
>                 URL: https://issues.apache.org/jira/browse/DRILL-8139
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.19.0
>            Reporter: James Turton
>            Priority: Blocker
>
> In previously released versions of Drill, back to at least 1.17, this bug 
> only appears under the combination of the async column reader and the _sync_ 
> page reader, as per the reproduction script below.  In master, the bug 
> appears under the async column reader and both the sync and async page 
> readers.
> set `store.parquet.compression` = 'gzip';
> drop table if exists dfs.tmp.m;
> create table dfs.tmp.m as select * from cp.`tpch/supplier.parquet`;
> set `store.parquet.reader.pagereader.async` = false;
> set `store.parquet.reader.columnreader.async` = true;
> select * from dfs.tmp.m order by s_suppkey; -- repeat this last query and 
> watch the returned data.  Eventually you will also failed queries or JVM 
> crashes



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-8139) Data corruption and occasional segfaults querying Parquet/gzip under the async column reader and sync page reader

Reply via email to