[GitHub] [druid] wangxiaobaidu11 edited a comment on pull request #10920: Spark Direct Readers and Writers for Druid.

GitBox Fri, 10 Dec 2021 04:05:30 -0800


wangxiaobaidu11 edited a comment on pull request #10920:
URL: https://github.com/apache/druid/pull/10920#issuecomment-990912600



   > @wangxiaobaidu11 you don't need to make changes to the druid spark code 
for your use case - you can call 
`AggregatorFactoryRegistry.register("longUnique", new 
LongUniqueAggregatorFactory("", "", 0)` from within your own spark app. That's 
definitely still ugly since the AggregatorFactory instance is unnecessary, but 
as mentioned in my previous comment this won't be the case for long. If 
instantiating an instance is a problem, there is one other temporary 
work-around: because all `AggregatorFactoryRegistry` does under the hood is 
register subtypes, you can use the public package method `registerSubType`. In 
your case, you would call `org.apache.druid.spark.registerSubtype(new 
NamedType(classOf[LongUniqueAggregatorFactory], "longUnique"))` from your spark 
app. (You can statically that method if you'd like, leaving just 
`registerSubtype(...)`)
   
   Thanks！I will update it .  I have another question. 
   ①when i set： 
   
![image](https://user-images.githubusercontent.com/24448732/145568671-80f11524-64b8-4185-89ab-5bfc68b6a5bb.png)
   ②spark runtime info：
   `21/12/10 16:09:45 INFO DruidDataSourceWriter: Committing the following 
segments: DataSegment{binaryVersion=9, 
id=test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.601Z,
 loadSpec={type=>hdfs, 
path=>hdfs://xxxx/xxxx/xxx/xxxxxx/segments/test_spark_druid_cube_v4/20200101T000000.000Z_20200102T000000.000Z/2021-12-10T08_09_41.601Z/3_5427a1c2-6405-4516-83b1-2dd17bfff433_index.zip},
 dimensions=[dim1, dim2, id1, id2], metrics=[count, sum_metric1, sum_metric2, 
sum_metric3, sum_metric4, uniq_id1_unique], 
shardSpec=NumberedShardSpec{partitionNum=0, partitions=1}, 
lastCompactionState=null, size=3390}, DataSegment{binaryVersion=9, 
id=test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.443Z,
 loadSpec={type=>hdfs, 
path=>hdfs://xxxx/xxxx/xxx/xxxxxx/segments/test_spark_druid_cube_v4/20200101T000000.000Z_20200102T000000.000Z/2021-12-10T08_09_41.443Z/1_c5be6b7e-76a6-44dd-9c53-f189950cb54d_index.zip},
 dimensions=[d
 im1, dim2, id1, id2], metrics=[count, sum_metric1, sum_metric2, sum_metric3, 
sum_metric4, uniq_id1_unique], shardSpec=NumberedShardSpec{partitionNum=0, 
partitions=1}, lastCompactionState=null, size=3390}, 
DataSegment{binaryVersion=9, 
id=test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.767Z,
 loadSpec={type=>hdfs, 
path=>hdfs://xxxx/xxxx/xxx/xxxxxx/segments/test_spark_druid_cube_v4/20200101T000000.000Z_20200102T000000.000Z/2021-12-10T08_09_41.767Z/2_04fc7ed4-4131-4856-b60d-95c7b409251c_index.zip},
 dimensions=[dim1, dim2, id1, id2], metrics=[count, sum_metric1, sum_metric2, 
sum_metric3, sum_metric4, uniq_id1_unique], 
shardSpec=NumberedShardSpec{partitionNum=0, partitions=1}, 
lastCompactionState=null, size=3390}, DataSegment{binaryVersion=9, 
id=test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.336Z,
 loadSpec={type=>hdfs, 
path=>hdfs://xxxx/xxxx/xxx/xxxxxx/segments/test_spark_druid_cube_v4/20200101T0
 
00000.000Z_20200102T000000.000Z/2021-12-10T08_09_41.336Z/0_918c926a-5738-4a19-a58b-8a3024ee01ad_index.zip},
 dimensions=[dim1, dim2, id1, id2], metrics=[count, sum_metric1, sum_metric2, 
sum_metric3, sum_metric4, uniq_id1_unique], 
shardSpec=NumberedShardSpec{partitionNum=0, partitions=1}, 
lastCompactionState=null, size=3390}, DataSegment{binaryVersion=9, 
id=test_spark_druid_cube_v4_2020-01-02T00:00:00.000Z_2020-01-03T00:00:00.000Z_2021-12-10T08:09:41.299Z,
 loadSpec={type=>hdfs, 
path=>hdfs://xxxx/xxxx/xxx/xxxxxx/segments/test_spark_druid_cube_v4/20200102T000000.000Z_20200103T000000.000Z/2021-12-10T08_09_41.299Z/5_c62528cf-4377-4fa4-98da-df09dcc8e359_index.zip},
 dimensions=[dim1, dim2, id1, id2], metrics=[count, sum_metric1, sum_metric2, 
sum_metric3, sum_metric4, uniq_id1_unique], 
shardSpec=NumberedShardSpec{partitionNum=0, partitions=1}, 
lastCompactionState=null, size=3390}, DataSegment{binaryVersion=9, 
id=test_spark_druid_cube_v4_2020-01-02T00:00:00.000Z_2020-01-03T00:00:00.000Z_2021-
 12-10T08:09:41.835Z, loadSpec={type=>hdfs, 
path=>hdfs://xxxx/xxxx/xxx/xxxxxx/segments/test_spark_druid_cube_v4/20200102T000000.000Z_20200103T000000.000Z/2021-12-10T08_09_41.835Z/4_176e1242-cb5f-4745-bae3-0e9bb47b6c62_index.zip},
 dimensions=[dim1, dim2, id1, id2], metrics=[count, sum_metric1, sum_metric2, 
sum_metric3, sum_metric4, uniq_id1_unique], 
shardSpec=NumberedShardSpec{partitionNum=0, partitions=1}, 
lastCompactionState=null, size=3466}
   21/12/10 16:09:45 INFO SQLMetadataStorageUpdaterJobHandler: Published 
test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.601Z
   21/12/10 16:09:45 INFO SQLMetadataStorageUpdaterJobHandler: Published 
test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.443Z
   21/12/10 16:09:45 INFO SQLMetadataStorageUpdaterJobHandler: Published 
test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.767Z
   21/12/10 16:09:45 INFO SQLMetadataStorageUpdaterJobHandler: Published 
test_spark_druid_cube_v4_2020-01-01T00:00:00.000Z_2020-01-02T00:00:00.000Z_2021-12-10T08:09:41.336Z
   21/12/10 16:09:45 INFO SQLMetadataStorageUpdaterJobHandler: Published 
test_spark_druid_cube_v4_2020-01-02T00:00:00.000Z_2020-01-03T00:00:00.000Z_2021-12-10T08:09:41.299Z
   21/12/10 16:09:45 INFO SQLMetadataStorageUpdaterJobHandler: Published 
test_spark_druid_cube_v4_2020-01-02T00:00:00.000Z_2020-01-03T00:00:00.000Z_2021-12-10T08:09:41.835Z
   21/12/10 16:09:45 INFO WriteToDataSourceV2Exec: Data source writer 
org.apache.druid.spark.v2.writer.DruidDataSourceWriter@1d1c63af committed.`
   ③  the same date is  covered，but I didn't want that to happen
   `21/12/10 16:09:45 WARN SegmentRationalizer: More than one version detected 
for interval 2020-01-01T00:00:00.000Z/2020-01-02T00:00:00.000Z on dataSource 
test_spark_druid_cube_v4! Some segments will be overshadowed!
   21/12/10 16:09:45 WARN SegmentRationalizer: More than one version detected 
for interval 2020-01-02T00:00:00.000Z/2020-01-03T00:00:00.000Z on dataSource 
test_spark_druid_cube_v4! Some segments will be overshadowed!`
   
![image](https://user-images.githubusercontent.com/24448732/145570369-e7ec2c79-7173-4ac0-bdb6-37ce21f88a6e.png)
   ④ I expect result which is  combined segments，How do I set partition
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] wangxiaobaidu11 edited a comment on pull request #10920: Spark Direct Readers and Writers for Druid.

Reply via email to