JulianJaffePinterest commented on pull request #10920:
URL: https://github.com/apache/druid/pull/10920#issuecomment-993161134


   Calling `.partitionBy` on a `DataFrameWriter` (what you get when you call 
`.write()` on a DataFrame`) doesn't do anything for a v2 data source that 
doesn't have a managed catalog, which Druid does not (see #11929 for a recent 
example). The 
[docs](https://github.com/apache/druid/blob/8392f87236d4a9795aa4e2867eea18cdf0aeb8ec/docs/operations/spark.md#writer)
 have a more in-depth discussion of partitioning, but the short version is that 
you'll either need to partition your dataframe before calling `.write()` on it 
or use one of the `DruidDataFrame` wrapper's convenience methods (for example,
   
   ```scala
   import org.apache.druid.spark.DruidDataFrame
   
   df.partitionAndWrite("__time", "millis", "DAY", 
200000).format("druid").mode(SaveMode.Overwrite).options(map).save()
   ```
   or in Java
   ```java
   import org.apache.druid.spark.package$.MODULE$.DruidDataFrame
   
   DruidDataFrame(dataset).partitionAndWrite("__time", "millis", "DAY", 
200000).format("druid").mode(SaveMode.Overwrite).options(map).save();
   ```)
   
   If you don't want to use implicits/wrapper classes, you can also use the 
partitioner directly:
   ```java
   SingleDimensionPartitioner partitioner = new 
SingleDimensionPartitioner(dataset);
   Dataset<Row> partitionedDataSet = partitioner.partition("__time", "millis", 
"DAY", 200000, "dim1", true);
   
partitionedDataset.write().format("druid").mode(SaveMode.Overwrite).options(map).save();
   ```
   
   Also, are you setting `writer.version` in your options map? I'm surprised to 
see the segments differ in version between each partition. That's what's 
causing the partitions to overshadow each other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to