#general
@dwarvenkat: @dwarvenkat has joined the channel
@christil: @christil has joined the channel
@mayanks: Hello Pinot community wondering if there might be interest in talks about your use cases with Pinot in ApacheCon:
@steotia: Hey Mayank, Uber and Li submitted a joint proposal
@mayanks: Thanks Sid, awesome. What's the talk title?
@steotia: Here is the
@mayanks: Nice, thanks!
@karinwolok1: Cool! Thanks!! Awesome @steotia!!!! :heart:
@karinwolok1: That sounds like an awesome presenation. I wonder if it would make sense to push it into a bunch of other conferences as well
@karinwolok1: and thank you for sharing @mayanks
@yupeng: hey, Pinot community, I want to share this Uber engineering blog (
@mayanks: Took the liberty of sharing this post on LinkedIn.
@chinmay.cerebro: Great job @ujwala.tulshigiri and @dharakkharod! Its an awesome blog.
@rkruze: @rkruze has joined the channel
@kevdesigned: @kevdesigned has joined the channel
#random
@dwarvenkat: @dwarvenkat has joined the channel
@christil: @christil has joined the channel
@rkruze: @rkruze has joined the channel
@kevdesigned: @kevdesigned has joined the channel
#troubleshooting
@syedakram93: Any update on this?
@fx19880617: what’s the issue?
@syedakram93:
@fx19880617: I tried with your setup and it works
@fx19880617: this is my ingestion config: ```➜ cat examples/batch/jsontype/ingestionJobSpec.yaml # executionFrameworkSpec: Defines ingestion jobs to be running. executionFrameworkSpec: # name: execution framework name name: 'standalone' # segmentGenerationJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentGenerationJobRunner interface. segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' # segmentTarPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentTarPushJobRunner interface. segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' # segmentUriPushJobRunnerClassName: class name implements org.apache.pinot.spi.batch.ingestion.runner.SegmentUriPushJobRunner interface. segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' # jobType: Pinot ingestion job type. # Supported job types are: # 'SegmentCreation' # 'SegmentTarPush' # 'SegmentUriPush' # 'SegmentCreationAndTarPush' # 'SegmentCreationAndUriPush' jobType: SegmentCreationAndTarPush # inputDirURI: Root directory of input data, expected to have scheme configured in PinotFS. inputDirURI: 'examples/batch/jsontype/rawdata' # includeFileNamePattern: include file name pattern, supported glob pattern. # Sample usage: # 'glob:*.avro' will include all avro files just under the inputDirURI, not sub directories; # 'glob:**/*.avro' will include all the avro files under inputDirURI recursively. includeFileNamePattern: 'glob:**/*.json' # excludeFileNamePattern: exclude file name pattern, supported glob pattern. # Sample usage: # 'glob:*.avro' will exclude all avro files just under the inputDirURI, not sub directories; # 'glob:**/*.avro' will exclude all the avro files under inputDirURI recursively. # _excludeFileNamePattern: '' # outputDirURI: Root directory of output segments, expected to have scheme configured in PinotFS. outputDirURI: 'examples/batch/jsontype/segments' # overwriteOutput: Overwrite output segments if existed. overwriteOutput: true # pinotFSSpecs: defines all related Pinot file systems. pinotFSSpecs: - # scheme: used to identify a PinotFS. # E.g. local, hdfs, dbfs, etc scheme: file # className: Class name used to create the PinotFS instance. # E.g. # org.apache.pinot.spi.filesystem.LocalPinotFS is used for local filesystem # org.apache.pinot.plugin.filesystem.AzurePinotFS is used for Azure Data Lake # org.apache.pinot.plugin.filesystem.HadoopPinotFS is used for HDFS className: org.apache.pinot.spi.filesystem.LocalPinotFS # recordReaderSpec: defines all record reader recordReaderSpec: # dataFormat: Record data format, e.g. 'avro', 'parquet', 'orc', 'csv', 'json', 'thrift' etc. dataFormat: 'json' # className: Corresponding RecordReader class name. # E.g. # org.apache.pinot.plugin.inputformat.avro.AvroRecordReader # org.apache.pinot.plugin.inputformat.csv.CSVRecordReader # org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader # org.apache.pinot.plugin.inputformat.json.JSONRecordReader # org.apache.pinot.plugin.inputformat.orc.ORCRecordReader # org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReader className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader' # configClassName: Corresponding RecordReaderConfig class name, it's mandatory for CSV and Thrift file format. # E.g. # org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig # org.apache.pinot.plugin.inputformat.thrift.ThriftRecordReaderConfig configClassName: # configs: Used to init RecordReaderConfig class name, this config is required for CSV and Thrift data format. configs: # tableSpec: defines table name and where to fetch corresponding table config and table schema. tableSpec: # tableName: Table name tableName: 'myTable' # schemaURI: defines where to read the table schema, supports PinotFS or HTTP. # E.g. #
@fx19880617: This is the job log: ```➜ bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile examples/batch/jsontype/ingestionJobSpec.yaml SegmentGenerationJobSpec: !!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec cleanUpOutputDir: false excludeFileNamePattern: null executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner, segmentMetadataPushJobRunnerClassName: null, segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner, segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner} includeFileNamePattern: glob:**/*.json inputDirURI: examples/batch/jsontype/rawdata jobType: SegmentCreationAndTarPush outputDirURI: examples/batch/jsontype/segments overwriteOutput: true pinotClusterSpecs: - {controllerURI: '
@fx19880617:
@fx19880617: here are table/schema/ingestionSpec I’m using
@dwarvenkat: @dwarvenkat has joined the channel
@christil: @christil has joined the channel
@rkruze: @rkruze has joined the channel
@kevdesigned: @kevdesigned has joined the channel
#pinot-dev
@mayanks: Sorry for spamming in various channels, but this is the last one :slightly_smiling_face: Any interest from Pinot-dev to present in ApacheCon:
@christil: @christil has joined the channel
#complex-type-support
@g.kishore: @amrish.k.lal did we summarize the notes from our zoom call the other day
@amrish.k.lal: Yes, the modified summary is a few comments above along with link to the doc.
@g.kishore: ok, we need to capture the discussion about storage of json and struct.
@g.kishore: I have been thinking for sometime on storing any arbitrary json in an efficient way
@amrish.k.lal: Sure, we should discuss that, but I think for now we are mainly looking at the near-term items.
@g.kishore: yeah, will write it up when i get closer
@g.kishore: btw, I added some comments on the near term. I had not reviewed them in detail.
@amrish.k.lal: Also, there was a fair amount of discussion on complex type (STRUCT/MAP/LIST) as well, so here is a summary related to that discussion: ```Adding support for JSON has commonalities with adding support for complex data types (STRUCT, LIST, MAP) with the key difference between JSON and STRUCT/LIST/MAP support being that JSON will not enforce schema validation (in keeping with JSON standard) while as STRUCT/LIST/MAP will support schema validation. A table could be defined with both a JSON column and a STRUCT/LIST/MAP column. For example: nestedColumn1 JSON, nestedCOLUMN2 STRUCT (name : string, age: INT : salary : INT, addresses : LIST (STRUCT ( apt: int, street : string, city : string, zip : INT ))) The implementation steps that we describe under "Near Term Enhancements" are common to supporting both JSON and complex data types (STRUCT, LIST, MAP). Both JSON and STRUCT/LIST/MAP columns: -would be stored as text, -would use JsonIndex for fast filtering (with additional support for multidimensional arrays) -be queried via new dot/array based syntax as proposed in "Language enhancements" In the long term it is quite possible that these data types share common hierarchical indexing functionality and storage mechanisms while providing JSON specific semantics with JSON column type and a more well-defined schema and type checking semantics with STRUCT/LIST/MAP type.```
@jackie.jxt: I would suggest focusing on JSON for now. I don't know if we want to store complex type as text in the long term. With fixed schema, we can potentially store the complex type values in a much more compressed way.
@jackie.jxt: We should be able to get some best practices from the json support, then we can use them to design the complex type later
@amrish.k.lal: @g.kishore If you are moving forward with long term storage aspects of JSON, would definitely like to discuss that further with you. cc: @steotia
@g.kishore: yes, will write it up
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org