liucao-dd commented on PR #213:
URL: 
https://github.com/apache/cassandra-analytics/pull/213#issuecomment-4584108377

   Good point. I checked Spark's `Partitioning` contract and several maintained 
Spark V2 connectors (Iceberg, Paimon, ClickHouse, StarRocks, YDB, Lance). The 
common pattern is to instantiate `UnknownPartitioning` directly when the scan 
cannot guarantee keyed grouping, and reserve `KeyGroupedPartitioning` for cases 
where the connector can prove rows are grouped by the reported key expressions.
   
   That applies here: Cassandra analytics input partitions are token ranges, so 
a single Spark partition can contain many Cassandra partition keys/token 
values. We should not claim `KeyGroupedPartitioning`, and a Cassandra-specific 
subclass adds nothing over Spark's public `UnknownPartitioning`.
   
   Updated: removed `CassandraPartitioning` and 
`CassandraScanBuilder.outputPartitioning()` now returns `new 
UnknownPartitioning(dataLayer.partitionCount())` directly. The unit test still 
asserts the reported partitioning is `UnknownPartitioning` with the correct 
partition count.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to