[GitHub] [druid] dmarkhas opened a new issue #10687: Hadoop ingestion ignores useDefaultValueForNull=false for metrics with rollup enabled

GitBox Thu, 17 Dec 2020 04:58:26 -0800


dmarkhas opened a new issue #10687:
URL: https://github.com/apache/druid/issues/10687



   When using index_hadoop to ingest a parquet file with null values in a 
metric column, the useDefaultValueForNull=false is not respected when rollup is 
enabled.
   longSum aggregations for dimensions where the metric values are all null, 
are calculated as 0 instead of null.
   
   Ingesting the same file with index_parallel from local storage or natively 
from S3, results in the correct behaviour and the longSum aggregations are 
calculated as null.
   
   ### Affected Version
   
   0.19.0
   
   ### Description
   
   The parquet file has 3 columns - name (string), age (int), index (int) and 
was created with the following Spark code:
   
   ```
   val dfdata = Seq(Row("Dan",35, null), Row("Dan",34, null), Row("John",20, 
2), Row("Mike",30, 9), Row("Sam",40, 0), Row("Tom", 17, 1))
   
   val dfSchema = List(StructField("name", StringType, true), 
StructField("age", IntegerType, true), StructField("index", IntegerType, true))
   
   val df = spark.createDataFrame(spark.sparkContext.parallelize(dfdata), 
StructType(dfSchema))
   
   ```
   
   As you can see, the "index" column is null for all records with name = 
'Dan', thus I would expect the longSum aggregation of "index" to be null for 
these rows.
   
   The parquet file itself is attached.
   
   The ingestion spec used for ingestion via hadoop:
   
   ```
   {
     "type": "index_hadoop",
     "spec": {
     "ioConfig": {
       "type": "hadoop",
       "inputSpec": {
         "type": "static",
         
"paths":"s3a://<BUCKET>/part-00000-5933233a-9db6-4bbb-8529-fa9687c9b2f1-c000.gz.parquet",
         "inputFormat": 
"org.apache.druid.data.input.parquet.DruidParquetInputFormat"
       }
     },
     "tuningConfig" : {
         "type": "hadoop"
       },
       "dataSchema": {
         "dataSource": "parquet_test_hadoop",
         "parser": {
           "type": "parquet",
           "parseSpec": {
             "format": "parquet",
             "dimensionsSpec": {
               "dimensions": ["name"],
               "dimensionExclusions": []
             },
             "timestampSpec": {
                "missingValue": "2010-01-01T00:00:00.000Z",
               "format": "auto",
               "column": "timestamp"
             }
           }
         },
         "metricsSpec":[
                {"name":"sum_idx", "type":"longSum","fieldName":"index"}
                ],
         "granularitySpec": {
           "type": "uniform",
           "queryGranularity": "DAY",
           "rollup": true,
           "segmentGranularity": "DAY"
         }
       }
     }
   }
   ```
   
   The ingestion spec used to ingest natively from S3:
   
   ```
   {
     "type": "index_parallel",
     "spec": {
       "dataSchema": {
         "dataSource": "parquet_test_s3",
         "timestampSpec": {
           "column": "timestamp",
           "format": "auto",
           "missingValue": "2010-01-01T00:00:00.000Z"
         },
         "dimensionsSpec": {
           "dimensions": ["name"]
         },
         "metricsSpec": [
           {
             "type": "longSum",
             "name": "sum_idx",
             "fieldName": "index"
           }
         ],
         "granularitySpec": {
           "type": "uniform",
           "segmentGranularity": "DAY",
           "queryGranularity": "DAY",
           "rollup": true
         }
       },
       "ioConfig": {
         "type": "index_parallel",
         "inputSource": {
           "type": "s3",
           "uris": [
             
"s3://<BUCKET>/part-00000-5933233a-9db6-4bbb-8529-fa9687c9b2f1-c000.gz.parquet"
           ],
           "properties": {
             "accessKeyId": {
               "type": "default",
               "password": "<ACCESS_KEY>"
             },
             "secretAccessKey": {
               "type": "default",
               "password": "<SECRET_KEY>"
             }
           }
         },
         "inputFormat": {
           "type": "parquet"
         }
       },
       "tuningConfig": {
         "type": "index_parallel"
       }
     },
     "dataSource": "parquet_test_s3"
   }
   ```
   
   useDefaultValueForNull is not respected for hadoop ingestion:
   
   
![image](https://user-images.githubusercontent.com/16191105/102490821-0a6b7600-4078-11eb-8e53-5c298651cfd8.png)
   
   useDefaultValueForNull is respected for native ingestion:
   
   
![image](https://user-images.githubusercontent.com/16191105/102490869-19eabf00-4078-11eb-826b-45c3aa675ab6.png)
   
[part-00000-5933233a-9db6-4bbb-8529-fa9687c9b2f1-c000.gz.parquet.gz](https://github.com/apache/druid/files/5709576/part-00000-5933233a-9db6-4bbb-8529-fa9687c9b2f1-c000.gz.parquet.gz)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] dmarkhas opened a new issue #10687: Hadoop ingestion ignores useDefaultValueForNull=false for metrics with rollup enabled

Reply via email to