hildo opened a new issue, #14678:
URL: https://github.com/apache/iceberg/issues/14678

   ### Query engine
   
   Java API.  I can also reproduce this error when selecting * in Spark SQL.
   
   ### Question
   
   I am experiencing an issue when reading from Iceberg tables from a Catalog.  
This is all using the Java API.   I'm using version 1.10.0
   
   Table Details
   - 3 columns (integer column named identifier, TimestampWithZone named 
eventTime, String named eventName)
   - Table uses partitioning (using the day() of the eventTime column)
   - uses the identifier column as an equalityId
   
   I am able to write Parquet files against this, using the 
PartitionedFanoutWriter, and it all works fine.  The intention of this is that 
rows will be identified by the value in "identifier".  The eventTime is the 
timestamp when data was ingested.
   
   I need to handle deleted records.  These records are ingested usually days 
after the records were first ingested.  So it cannot be assumed that the 
deletion will be in the same partition as the original ingestion.  For this 
reason, I am attempting an EqualityDeletion, which is global.  According to the 
spec for scan-planning, global DeleteFiles should be applied to all Partitions.
   
   So, I have created global Equality DeleteFiles, ensuring the file is written 
using PartitionSpec.unpartitioned().  I can see themdefined against the table.
   
   However, the scan does not filter out the deleted row.  I've been stepping 
through the code, and I can see where the DeleteFile is being tested as to 
whether it should be applied against the DataFile.  This is in 
org.apache.iceberg.DeleteFileIndex.Builder.add (line 531).
   
   The code
   - Looks for the SpecId defined against the Delete File...in this case, 0).
   - Looks for the PartitionSpec matching that id in a local map
   - Checks the PartitionSpec to see if it is partitioned. If the spec says 
unpartitioned, it will get applied.
   
   What I am seeing is it is always being treated as unpartitioned.  This is 
because the local Map of ParitionSpecs is always sourced from the Table, which 
only has a single PartitionSpec, which is the partitionSpec used when writing 
and reading DataFiles.
   
   Am I doing something wrong?  Should the table have two specifications (one 
for Unpartitioned and one for the paritioning) but I don't see how to do this?  
Is there any advise for how I should be defining my table in order to support 
reading and writing DataFiles with partitions but allow Global DeleteFiles?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to