[ 
https://issues.apache.org/jira/browse/ORC-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413868#comment-17413868
 ] 

Yiqun Zhang edited comment on ORC-991 at 9/13/21, 2:40 AM:
-----------------------------------------------------------

Hi [~dongjoon] 
Suddenly this is a tricky problem, my fix causes the old ORC file to be 
unreadable if it has encrypted fields.
The reason for this is convoluted.
Two bugs
1. row index does not use encryption
2. problem reading encrypted, unencrypted index and data order

These two problems cause encrypted fields to appear to work under older 
versions, only to have problems when pushing down to read. The user may not be 
aware that the row index of the encrypted field is not encrypted.

If a user uses my pr-fixed version to read an old file, it will unfortunately 
crash if there are encrypted fields in the file. The new version will assume 
that the row index of the encrypted field is also encrypted.

I think we may need to provide a detection mechanism for compatible reads, 
which is the cost of this bug. Also the convert tool can rewrite old files to 
help with batch fixes. 

So I reopened the issue.


was (Author: guiyankuang):
Hi [~dongjoon] 
Suddenly this is a tricky problem, my fix causes the old ORC file to be 
unreadable if it has encrypted fields.
The reason for this is convoluted.
Two bugs
1. row index does not use encryption
2. problem reading encrypted, unencrypted index and data order

These two problems cause encrypted fields to appear to work under older 
versions, only to have problems when pushing down to read. The user may not be 
aware that the row index of the encrypted field is not encrypted.

If a user uses my pr-fixed version to read an old file, it will unfortunately 
crash if there are encrypted fields in the file. The new version will assume 
that the row index of the encrypted field is also encrypted.

I think we may need to provide a detection mechanism for compatible reads, 
which is the cost of this bug. Also the convert tool can rewrite old files to 
help with batch fixes. 

> enctypt data throw exception with a sql filter push down
> --------------------------------------------------------
>
>                 Key: ORC-991
>                 URL: https://issues.apache.org/jira/browse/ORC-991
>             Project: ORC
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.7.0, 1.6.8, 1.6.9, 1.6.10
>         Environment: 1.ORC 1.6.8+
> 2.SparkSQL 2.4.7
> 3.JDK 1.8
>            Reporter: hgs
>            Assignee: Yiqun Zhang
>            Priority: Blocker
>             Fix For: 1.7.0, 1.6.11
>
>         Attachments: files.zip
>
>
> 1.create a table 
> CREATE TABLE `itmp8888`(`id` INT, `name` STRING)
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>  WITH SERDEPROPERTIES (
>  'serialization.format' = '1'
>  )
>  STORED AS
>  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>  TBLPROPERTIES (
>  'transient_lastDdlTime' = '1631174384',
>  'orc.encrypt' = 'AES_CTR_128:id,name',
>  'orc.mask' = 'sha256:id,name',
>  'orc.encrypt.ezk' = 'jNCeDBtNfT8wPaTpR34JHA=='
>  )
> 2. insert data
> 3.  a select statement that no filters is fine
>    select * from itmp8888
> 4. a select statement  with the filter including the encrypted column will 
> throw exception
>   select * from itmp8888 where id = 1
>  
> 5.the stack trace
> Caused by: java.lang.AssertionError: Index is not populated for 1Caused by: 
> java.lang.AssertionError: Index is not populated for 1 at 
> org.apache.orc.impl.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:995)
>  at 
> org.apache.orc.impl.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:1083)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1101) 
> at 
> org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1151)
>  at 
> org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1186)
>  at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:248) at 
> org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:864) at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:142)
>  at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2.apply(OrcFileFormat.scala:211)
>  at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2.apply(OrcFileFormat.scala:175)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> 6. I debug the code find that the RowIndex is null for all the encrypted 
> columns
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to