vikramsinghchandel opened a new issue #10210:
URL: https://github.com/apache/druid/issues/10210


   For same Data loads, the druid native ingestion approximately performed 3x 
worse as compared to EMR-Hadoop based indexing for rollups, and data size.
   
   ### 0.18.0 & 0.18.1
   Testing was done on both the version numbers are provided only for v0.18.1
   
   ### Description
   
   Ingestion Data: 1-hour data 300 json.gz files, each of ~136Mb in size
   Cluster details: See Below
   EMR Numbers: 
   ||Total Segments||Total Data Size for 17-07-2020 Hour 0||Total Rows||Avg 
Number of Rows / Segment||Roll Up Ratio||
   |18|7.2Gb|81M|5M|123|
   Native Ingestin Numbers: 
   ||Total Segments||Total Data Size for 17-07-2020 Hour 0||Total Rows||Avg 
Number of Rows / Segment||Roll Up Ratio||
   |50|16.29Gb|244M|4.8M|40|
   
   Please include as much detailed information about the problem as possible.
   - Cluster size : 
    -- EMR:: 22 * m5.2xl nodes == 176 vCPU
    -- Native (K8S indexer nodes) = 50 nodes each with 4 CPU and 26G memory 
(200 vCPU)
    --- Each node has 3 workers so a total of 150 workers are spawned for 
native ingestion.
   - Configurations & Spec in use
   
   **EMR-Hadoop Ingestion spec:**
   
   ```
   {
       "type":"index_hadoop",
       "spec":
       {
         "dataSchema": {
           "dataSource": "<ds name>",
           "parser": {
             "type": "hadoopyString",
             "parseSpec": {
               "format": "json",
               "timestampSpec": {
                 "column": "requestTime",
                 "format": "millis"
               },
               "dimensionsSpec": {
                 "dimensions": [<dimensions>],
                 "dimensionExclusions": [],
                 "spatialDimensions": []
               }
             }
           },
           "metricsSpec": [<metrics>],
           "granularitySpec": {
             "type": "uniform",
             "segmentGranularity": "HOUR",
             "queryGranularity": "HOUR",
             "intervals" : [ 
"2020-07-22T00:00:00.000Z/2020-07-22T01:00:00.000Z" ]
           }
         },
         "ioConfig": {
           "type": "hadoop",
           "inputSpec": {
             "type": "static",
             
             "paths": "<path>"
           }
         },
         "tuningConfig" : {
               "type" : "hadoop",
               "partitionsSpec" : {
                 "type" : "hashed",
                 "partitionDimension" : null,
                 "maxRowsPerSegment" : 5000000,
                 "assumeGrouped" : false,
                 "numShards" : -1
               },
               "shardSpecs" : { },
               "indexSpec" : {
                 "bitmap" : {
                   "type" : "roaring"
                 },
                 "dimensionCompression" : "lz4",
                 "metricCompression" : "lz4"
               },
               "leaveIntermediate" : false,
               "cleanupOnFailure" : true,
               "overwriteFiles" : true,
               "maxParseExceptions" : 1000,
               "jobProperties" : { },
               "combineText" : false,
               "aggregationBufferRatio" : 0.5,
               "rowFlushBoundary" : 300000,
               "useCombiner" : true,
               "numBackgroundPersistThreads" : 1
         }
       }
   }
   ```
   **Native Batch Spec:**
   
   ```
   {
     "type": "index_parallel",
     "spec": {
       "type": "index_parallel",
       "dataSchema": {
         "dataSource": "<ds name>",
         "timestampSpec": {
           "column": "requestTime",
           "format": "millis"
         },
         "dimensionsSpec": {
           "dimensions": [<dimensions>]
         },
         "metricsSpec": [<metrics>],
         "granularitySpec": {
           "type": "uniform",
           "segmentGranularity": "HOUR",
           "queryGranularity": "HOUR",
           "rollup": true,
           "intervals": [
             "2020-07-21/2020-07-23"
           ]
         }
       },
       "ioConfig": {
         "type": "index_parallel",
         "inputSource": {
           "type": "s3",
           "prefixes": [
             "<path>"
           ]
         },
         "inputFormat": {
           "type": "json"
         }
       },
       "tuningConfig": {
         "type": "index_parallel",
         "partitionsSpec": {
           "type": "hashed",
           "numShards": 50
         },
         "forceGuaranteedRollup": true,
         "totalNumMergeTasks": 100,
         "maxNumSegmentsToMerge": 100,
         "maxNumConcurrentSubTasks": 149,
         "maxRowsInMemory": 4000000,
         "maxPendingPersists": 2,
         "indexSpec": {
           "bitmap": {
             "type": "roaring"
           },
           "dimensionCompression": "lz4",
           "metricCompression": "lz4"
         }
       }
     }
   }
   ```
   
   Native Indexer runtime confs:
   # indexer configs
   druid.worker.version=0
   druid.worker.capacity=3
   druid.worker.numConcurrentMerges=2
   
   
   # Peon Processing configs
   druid.processing.numThreads=3
   druid.processing.numMergeBuffers=2
   druid.peon.defaultSegmentWriteOutMediumFactory.type=offHeapMemory
   
   **TLDR:** Because of the lower roll up ratio and increased number of rows 
the data size is more than tripled


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to