[jira] [Updated] (IMPALA-9515) Milestone 3: Reading “original files”

Jira Mon, 16 Mar 2020 07:58:27 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-9515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zoltán Borók-Nagy updated IMPALA-9515:
--------------------------------------
    Description: 
“Original files” don’t store special ACID columns, that means we need to 
auto-generate those values. Actually we only need to auto-generate the record 
id: (originalTransaction, bucket, rowId).
 * originalTransaction: can be parsed from the containing directory
 ** If it’s the table root directory then originalTransaction is 0

 * Bucket: it’s the bit-packed value of (bucket codec version, bucket id, and 
statement id)
 ** Bucket codec version is 1
 ** Bucket id can be parsed from the filename
 ** Statement id can be parsed from the delta directory:
 *** delta_<min_writeid>_<max_writeid>_<statement_id>
 *** (min_writeid = max_writeid for original files)

 * rowId: zero-based for each bucket, if there are multiple files in a single 
bucket:
 ** List all the files belonging to the bucket
 ** First file’s first row id is 0
 ** Next file’s first row id is the row count of the first file
 ** And so on

The frontend should generate the base record ID for each file and propagate 
that information to the scanners. Therefore the scanners would know if they are 
scanning files in full ACID format or raw format. The ORC scanner needs to be 
changed in order to generate and fill the ACID columns for original files.

  was:
“Original files” don’t store special ACID columns, that means we need to 
auto-generate those values. Actually we only need to auto-generate the record 
id: (originalTransaction, bucket, rowId).
 * originalTransaction: can be parsed from the containing directory
 * If it’s the table root directory then originalTransaction is 0


 * Bucket: it’s the bit-packed value of (bucket codec version, bucket id, and 
statement id)
 * Bucket codec version is 1
 * Bucket id can be parsed from the filename
 * Statement id can be parsed from the delta directory:
 * delta_<min_writeid>_<max_writeid>_<statement_id>
 * (min_writeid = max_writeid for original files)


 * rowId: zero-based for each bucket, if there are multiple files in a single 
bucket:
 * List all the files belonging to the bucket
 * First file’s first row id is 0
 * Next file’s first row id is the row count of the first file
 * And so on

The frontend should generate the base record ID for each file and propagate 
that information to the scanners. Therefore the scanners would know if they are 
scanning files in full ACID format or raw format. The ORC scanner needs to be 
changed in order to generate and fill the ACID columns for original files.


> Milestone 3: Reading “original files”
> -------------------------------------
>
>                 Key: IMPALA-9515
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9515
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-acid
>
> “Original files” don’t store special ACID columns, that means we need to 
> auto-generate those values. Actually we only need to auto-generate the record 
> id: (originalTransaction, bucket, rowId).
>  * originalTransaction: can be parsed from the containing directory
>  ** If it’s the table root directory then originalTransaction is 0
>  * Bucket: it’s the bit-packed value of (bucket codec version, bucket id, and 
> statement id)
>  ** Bucket codec version is 1
>  ** Bucket id can be parsed from the filename
>  ** Statement id can be parsed from the delta directory:
>  *** delta_<min_writeid>_<max_writeid>_<statement_id>
>  *** (min_writeid = max_writeid for original files)
>  * rowId: zero-based for each bucket, if there are multiple files in a single 
> bucket:
>  ** List all the files belonging to the bucket
>  ** First file’s first row id is 0
>  ** Next file’s first row id is the row count of the first file
>  ** And so on
> The frontend should generate the base record ID for each file and propagate 
> that information to the scanners. Therefore the scanners would know if they 
> are scanning files in full ACID format or raw format. The ORC scanner needs 
> to be changed in order to generate and fill the ACID columns for original 
> files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-9515) Milestone 3: Reading “original files”

Reply via email to