[GitHub] [iceberg] yyanyy edited a comment on pull request #1975: Core: add sort order id to content file

GitBox Tue, 26 Jan 2021 16:13:58 -0800


yyanyy edited a comment on pull request #1975:
URL: https://github.com/apache/iceberg/pull/1975#issuecomment-755053818



   > > Not sure if sort order should be nullable by default or 0 (from 
unsorted_order)
   > 
   > The field should be optional because v1 manifests will not have the order 
field. Iceberg will read the value as null, so I think it makes sense to use 
null. And you're right about not storing it for position deletes.
   > 
   > > Do we want only sort order id, or actual sort order struct?
   > 
   > We want the ID. Sort orders are attached to table metadata, so loading the 
order should be a simple hash map lookup.
   > 
   > > For the next PR, do we assume the table's current sort order id is the 
authoritative place to get sort order information when adding a new file?
   > 
   > No. Engines must specify which sort order was used to write a file 
explicitly. So this needs to be exposed in the DataFile and DeleteFile 
builders. By default, we should write either null or 0 (unordered). Probably 
null.
   
   Thank you for the response! 
   
   > We want the ID. Sort orders are attached to table metadata, so loading the 
order should be a simple hash map lookup.
   
   I guess in order to do that, we may need to add the file's sort order map in 
`FileScanTask`, as it seems like in readers (e.g. `RowDataReader`) we rely on 
it for reading rows, meanwhile we don't have the table available for metadata 
lookup?
   
   > Engines must specify which sort order was used to write a file explicitly.
   
   (Sorry for the naive question) I guess the sort order needs to be decided 
when building the writer (e.g. add a `sortOrder` parameter in [`SparkWriter` 
writer 
factory](https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java#L518)),
 and how does the engine know which sort order to use when writing files? Maybe 
the sort order could be an optional thing to specify when a job is created 
(e.g. as part of the the sql command for ingesting data) , and thus the engine 
will already know the sort order to use when it creates the writer, although 
some validations might need to be done against table metadata before that (e.g. 
check for such sort order exists, create one or abort if not); and if nothing 
is specified for this job/command, the engine will look for table's default 
sort order, and use it for creating the writer? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy edited a comment on pull request #1975: Core: add sort order id to content file

Reply via email to