Apache Pinot Daily Email Digest (2021-04-29)

Pinot Slack Email Digest Thu, 29 Apr 2021 19:00:40 -0700

#general

@dwarvenkat: @dwarvenkat has joined the channel
@christil: @christil has joined the channel
@mayanks: Hello Pinot community wondering if there might be interest in talks about your use cases with Pinot in ApacheCon:
@steotia: Hey Mayank, Uber and Li submitted a joint proposal
@mayanks: Thanks Sid, awesome. What's the talk title?
@steotia: Here is the . The title was changed during submission to "Running Analytics at scale with Apache Pinot at LinkedIn and Uber"
@mayanks: Nice, thanks!
@karinwolok1: Cool! Thanks!! Awesome @steotia!!!! :heart:
@karinwolok1: That sounds like an awesome presenation. I wonder if it would make sense to push it into a bunch of other conferences as well
@karinwolok1: and thank you for sharing @mayanks
@yupeng: hey, Pinot community, I want to share this Uber engineering blog () published today on how Uber combats COVID-related challenges for restaurants and other merchants across the world, using Apache Pinot and real-time analytics. Nice blog from @ujwala.tulshigiri @dharakkharod
@mayanks: Took the liberty of sharing this post on LinkedIn.
@chinmay.cerebro: Great job @ujwala.tulshigiri and @dharakkharod! Its an awesome blog.
@rkruze: @rkruze has joined the channel
@kevdesigned: @kevdesigned has joined the channel

#random

@dwarvenkat: @dwarvenkat has joined the channel
@christil: @christil has joined the channel
@rkruze: @rkruze has joined the channel
@kevdesigned: @kevdesigned has joined the channel

#troubleshooting

@syedakram93: Any update on this?
@fx19880617: what’s the issue?
@syedakram93:
@fx19880617: I tried with your setup and it works
@fx19880617: this is my ingestion config: ```➜ cat examples/batch/jsontype/ingestionJobSpec.yaml # executionFrameworkSpec: Defines ingestion jobs to be running. executionFrameworkSpec: # name: execution framework name name: 'standalone' # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface. segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface. segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface. segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' # jobType: Pinot ingestion job type. # Supported job types are: # 'SegmentCreation' # 'SegmentTarPush' # 'SegmentUriPush' # 'SegmentCreationAndTarPush' # 'SegmentCreationAndUriPush' jobType: SegmentCreationAndTarPush # inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS. inputDirURI: 'examples/batch/jsontype/rawdata' # includeFileNamePattern: include file name pattern, supported glob pattern. # Sample usage: # 'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories; # 'glob:**/*.avro' will include all the avro files under inputDirURI recursively. includeFileNamePattern: 'glob:**/*.json' # excludeFileNamePattern: exclude file name pattern, supported glob pattern. # Sample usage: # 'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories; # 'glob:**/*.avro' will exclude all the avro files under inputDirURI recursively. # _excludeFileNamePattern: '' # outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS. outputDirURI: 'examples/batch/jsontype/segments' # overwriteOutput: Overwrite output segments if existed. overwriteOutput: true # pinotFSSpecs: defines all related Pinot file systems. pinotFSSpecs: - # scheme: used to identify a PinotFS. # E.g. local, hdfs, dbfs, etc scheme: file # className: Class name used to create the PinotFS instance. # E.g. # org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem # org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake # org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS className: org.apache.pinot.spi.filesystem.LocalPinotFS # recordReaderSpec: defines all record reader recordReaderSpec: # dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc. dataFormat: 'json' # className: Corresponding RecordReader class name. # E.g. # org.apache.pinot.plugin.inputformat.avro.AvroRecordReader # org.apache.pinot.plugin.inputformat.csv.CSVRecordReader # org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader # org.apache.pinot.plugin.inputformat.json.JSONRecordReader # org.apache.pinot.plugin.inputformat.orc.ORCRecordReader # org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader' # configClassName: Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format. # E.g. # org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig # org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig configClassName: # configs: Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format. configs: # tableSpec: defines table name and where to fetch corresponding table config and table schema. tableSpec: # tableName: Table name tableName: 'myTable' # schemaURI: defines where to read the table schema, supports PinotFS or HTTP. # E.g. # # schemaURI: '' # tableConfigURI: defines where to reade the table config. # Supports using PinotFS or HTTP. # E.g. # # # Note that the API to read Pinot table config directly from pinot controller contains a JSON wrapper. # The real table config is the object under the field 'OFFLINE'. tableConfigURI: '' # pinotClusterSpecs: defines the Pinot Cluster Access Point. pinotClusterSpecs: - # controllerURI: used to fetch table/schema information and data push. # E.g. controllerURI: '' # pushJobSpec: defines segment push job related configuration. pushJobSpec: # pushAttempts: number of attempts for push job, default is 1, which means no retry. pushAttempts: 2 # pushRetryIntervalMillis: retry wait Ms, default to 1 second. pushRetryIntervalMillis: 1000```
@fx19880617: This is the job log: ```➜ bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/jsontype/ingestionJobSpec.yaml SegmentGenerationJobSpec: !!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec cleanUpOutputDir: false excludeFileNamePattern: null executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner, segmentMetadataPushJobRunnerClassName: null, segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner, segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner} includeFileNamePattern: glob:**/*.json inputDirURI: examples/batch/jsontype/rawdata jobType: SegmentCreationAndTarPush outputDirURI: examples/batch/jsontype/segments overwriteOutput: true pinotClusterSpecs: - {controllerURI: ''} pinotFSSpecs: - {className: org.apache.pinot.spi.filesystem.LocalPinotFS, configs: null, scheme: file} pushJobSpec: {pushAttempts: 2, pushParallelism: 1, pushRetryIntervalMillis: 1000, segmentUriPrefix: null, segmentUriSuffix: null} recordReaderSpec: {className: org.apache.pinot.plugin.inputformat.json.JSONRecordReader, configClassName: null, configs: null, dataFormat: json} segmentCreationJobParallelism: 0 segmentNameGeneratorSpec: null tableSpec: {schemaURI: '', tableConfigURI: '', tableName: myTable} tlsSpec: null Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner Creating an executor service with 1 threads(Job parallelism: 0, available cores: 16.) Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS Submitting one Segment Generation Task for file:/Users/xiangfu/workspace/pinot-dev/pinot-distribution/target/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/examples/batch/jsontype/rawdata/data.json Initialized FunctionRegistry with 119 functions: [fromepochminutesbucket, arrayunionint, codepoint, mod, sha256, year, yearofweek, upper, arraycontainsstring, arraydistinctstring, bytestohex, tojsonmapstr, trim, timezoneminute, sqrt, togeometry, normalize, fromepochdays, arraydistinctint, exp, jsonpathlong, yow, toepochhoursrounded, lower, toutf8, concat, ceil, todatetime, jsonpathstring, substr, dayofyear, contains, jsonpatharray, arrayindexofint, fromepochhoursbucket, arrayindexofstring, minus, arrayunionstring, toepochhours, toepochdaysrounded, millisecond, fromepochhours, arrayreversestring, dow, doy, min, toepochsecondsrounded, strpos, jsonpath, tosphericalgeography, fromepochsecondsbucket, max, reverse, hammingdistance, stpoint, abs, timezonehour, toepochseconds, arrayconcatint, quarter, md5, ln, toepochminutes, arraysortstring, replace, strrpos, jsonpathdouble, stastext, second, arraysortint, split, fromepochdaysbucket, lpad, day, toepochminutesrounded, fromdatetime, fromepochseconds, arrayconcatstring, base64encode, ltrim, arraysliceint, chr, sha, plus, base64decode, month, arraycontainsint, toepochminutesbucket, startswith, week, jsonformat, sha512, arrayslicestring, fromepochminutes, remove, dayofmonth, times, hour, rpad, arrayremovestring, now, divide, bigdecimaltobytes, floor, toepochsecondsbucket, toepochdaysbucket, hextobytes, rtrim, length, toepochhoursbucket, bytestobigdecimal, toepochdays, arrayreverseint, datetrunc, minute, round, dayofweek, arrayremoveint, weekofyear] in 733ms Using class: org.apache.pinot.plugin.inputformat.json.JSONRecordReader to read segment, ignoring configured file format: AVRO Finished building StatsCollector! Collected stats for 4 documents Using fixed length dictionary for column: subjects_grade, size: 20 Created dictionary for STRING column: subjects_grade with cardinality: 5, max length in bytes: 4, range: A to B-- Using fixed length dictionary for column: subjects_name, size: 5 Created dictionary for STRING column: subjects_name with cardinality: 1, max length in bytes: 5, range: maths to maths Using fixed length dictionary for column: name, size: 20 Created dictionary for STRING column: name with cardinality: 4, max length in bytes: 5, range: Pete to Pete3 Created dictionary for LONG column: age with cardinality: 4, range: 23 to 26 Start building IndexCreator! Finished records indexing in IndexCreator! Finished segment seal! Converting segment: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0 to v3 format v3 segment location for segment: myTable_OFFLINE_0 is /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3 Deleting files in v1 segment directory: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0 Computed crc = 3500070607, based on files [/var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3/columns.psf, /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3/index_map, /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0/v3/metadata.properties] Driver, record read time : 3 Driver, stats collector time : 0 Driver, indexing time : 12 Tarring segment from: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0 to: /var/folders/kp/v8smb2f11tg6q2grpwkq7qnh0000gn/T/pinot-4226d743-ee31-417a-806a-2c4752a21343/output/myTable_OFFLINE_0.tar.gz Size for segment: myTable_OFFLINE_0, uncompressed: 5.87K, compressed: 1.62K Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner Initializing PinotFS for scheme file, classname org.apache.pinot.spi.filesystem.LocalPinotFS Start pushing segments: [/Users/xiangfu/workspace/pinot-dev/pinot-distribution/target/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/apache-pinot-incubating-0.8.0-SNAPSHOT-bin/examples/batch/jsontype/segments/myTable_OFFLINE_0.tar.gz]... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@6304101a] for table myTable Pushing segment: myTable_OFFLINE_0 to location: for table myTable Sending request: to controller: 192.168.86.73, version: Unknown Response for pushing table myTable segment myTable_OFFLINE_0 to location - 200: {"status":"Successfully uploaded segment: myTable_OFFLINE_0 of table: myTable"}```
@fx19880617:
@fx19880617: here are table/schema/ingestionSpec I’m using
@dwarvenkat: @dwarvenkat has joined the channel
@christil: @christil has joined the channel
@rkruze: @rkruze has joined the channel
@kevdesigned: @kevdesigned has joined the channel

#pinot-dev

@mayanks: Sorry for spamming in various channels, but this is the last one :slightly_smiling_face: Any interest from Pinot-dev to present in ApacheCon:
@christil: @christil has joined the channel

#complex-type-support

@g.kishore: @amrish.k.lal did we summarize the notes from our zoom call the other day
@amrish.k.lal: Yes, the modified summary is a few comments above along with link to the doc.
@g.kishore: ok, we need to capture the discussion about storage of json and struct.
@g.kishore: I have been thinking for sometime on storing any arbitrary json in an efficient way
@amrish.k.lal: Sure, we should discuss that, but I think for now we are mainly looking at the near-term items.
@g.kishore: yeah, will write it up when i get closer
@g.kishore: btw, I added some comments on the near term. I had not reviewed them in detail.
@amrish.k.lal: Also, there was a fair amount of discussion on complex type (STRUCT/MAP/LIST) as well, so here is a summary related to that discussion: ```Adding support for JSON has commonalities with adding support for complex data types (STRUCT, LIST, MAP) with the key difference between JSON and STRUCT/LIST/MAP support being that JSON will not enforce schema validation (in keeping with JSON standard) while as STRUCT/LIST/MAP will support schema validation. A table could be defined with both a JSON column and a STRUCT/LIST/MAP column. For example: nestedColumn1 JSON, nestedCOLUMN2 STRUCT (name : string, age: INT : salary : INT, addresses : LIST (STRUCT ( apt: int, street : string, city : string, zip : INT ))) The implementation steps that we describe under "Near Term Enhancements" are common to supporting both JSON and complex data types (STRUCT, LIST, MAP). Both JSON and STRUCT/LIST/MAP columns: -would be stored as text, -would use JsonIndex for fast filtering (with additional support for multidimensional arrays) -be queried via new dot/array based syntax as proposed in "Language enhancements" In the long term it is quite possible that these data types share common hierarchical indexing functionality and storage mechanisms while providing JSON specific semantics with JSON column type and a more well-defined schema and type checking semantics with STRUCT/LIST/MAP type.```
@jackie.jxt: I would suggest focusing on JSON for now. I don't know if we want to store complex type as text in the long term. With fixed schema, we can potentially store the complex type values in a much more compressed way.
@jackie.jxt: We should be able to get some best practices from the json support, then we can use them to design the complex type later
@amrish.k.lal: @g.kishore If you are moving forward with long term storage aspects of JSON, would definitely like to discuss that further with you. cc: @steotia
@g.kishore: yes, will write it up
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org