<h3><u>#general</u></h3><br><strong>@mhomaid: </strong>@afilipchik @g.kishore
Updated the Kafka Sink connector and added more configurations for batch and
streaming. I also added initial code for generating the pinot segment. The
segment generator is boilerplates at this moment. I would appreciate your
guidance there. I didn’t want to go to far without reviewing it with you and
ensure we are on the right path. The repo is @
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMfbIlUXR74NmTijr4m8wCL51vLI5sH-2FJD7NmNJL2-2FNe7cHP3uilgYBKiJN-2F9L1x-2Bgw-3D-3DhGU2_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwWLyvlC1G9Lt9NOGPi2fUatzLJah4Q-2BMmypLeqjn1BHn1SGF4TGtZrhYb-2B7oOmE93sTg-2FK9waYaTFodzat4EhYtA15tZWhpmQre3B-2Fv1LiAGFkNj0YJPiOF4zWes7uSfX81bs57VCD3A0F3OrUqEJSyuA70B7HURkCEgX6OGotJxuWY1O6UY9O-2BJlOtSr16IA-3D><br><strong>@g.kishore:
</strong>Thanks @mhomaid Will take a look at it. Let’s create an issue and
start discussion?<br><strong>@mhomaid: </strong>Great<br><strong>@ajit.kumar:
</strong>@ajit.kumar has joined the
channel<br><h3><u>#random</u></h3><br><strong>@ajit.kumar: </strong>@ajit.kumar
has joined the
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@yash.agarwal:
</strong>I am getting
```java.lang.IllegalStateException: Unable to extract out the relative path
based on base input path:
<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl>
at
shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
at
org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationUtils.getRelativeOutputPath(SegmentGenerationUtils.java:144)
at
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner$1.call(SparkSegmentGenerationJobRunner.java:292)
at
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner$1.call(SparkSegmentGenerationJobRunner.java:214)```
the job config is
```inputDirURI:
'<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl>'
outputDirURI:
'<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl_segments>'```<br><strong>@fx19880617:
</strong>do you know the input path<br><strong>@fx19880617: </strong>we get
the input file like
`<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl/a/b/c.avro>`<br><strong>@fx19880617:
</strong>then output segment path should be
`<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl_segments/a/b/c.tar.gz>`<br><strong>@fx19880617:
</strong>we try to extract the relative path of
`a/b/c.avro`<br><strong>@fx19880617: </strong>so wanna check the input file
path<br><strong>@yash.agarwal: </strong>the input path something like
```<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl/partition_d=2020-05-17/00000_0>```<br><strong>@yash.agarwal:
</strong>ill try making it more like the format you
mentioned.<br><strong>@fx19880617: </strong>hmm so file name is
`00000_0`?<br><strong>@fx19880617: </strong>here is some sample
layout:<br><strong>@fx19880617:
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMSfW2QiSG4bkQpnpkSL7FiK3MHb8libOHmhAW89nP5XK3RW6mQznmb3W3HzGtaNDxBoE2C01zOHxwoGE8C8WcXMKqmLEXWoT-2FaOlbIgsgPVQISCnrNolafFFNTC0-2FrKbsWW-2FuuIZC3Y-2FrK3js1bQ-2Bx9LQeS4MkyAZCXjp6-2BTpt6k2mqH_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwWLyvlC1G9Lt9NOGPi2fUa8iLj9XcK-2BbB3WKrBCovOKMKIceq6X3O7ZcUK7hay0l8Z6zsLzJsA88RrB6LAKArkdkW4tS8vK87kYy7LbMfKKq88KSM1Dyt7nYGB2qn9WO4D7-2FaLgSxrL1K6ix65N1YXGKybhVl03mWeWm3dPl5AFgwz0jWAkrydQTAyqRwW5kU-3D><br><strong>@yash.agarwal:
</strong>I am getting the same issue after converting it into this
format.<br><strong>@yash.agarwal:
</strong>```<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl/2019/05/18/guest_sdr_gst_data_sgl_data_2019-05-18.avro>```<br><strong>@yash.agarwal:
</strong>```Unable to extract out the relative path based on base input path:
<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl>```<br><strong>@yash.agarwal:
</strong>I was able to log the inputfile and output dir with
it.<br><strong>@yash.agarwal: </strong>```java.lang.IllegalStateException:
Unable to extract out the relative path based on base input path:
<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl> inputFile:
hdfs:/user/Z00290G/guest_sdr_gst_data_sgl/2019/05/21/guest_sdr_gst_data_sgl_data_2019-05-21.avro
outputDir:
<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl_staging>```<br><strong>@yash.agarwal:
</strong>it is removing the `<hdfs://bigredns>` from the input
file.<br><strong>@fx19880617: </strong>I wrote a simple test
```
@Test
public void testGetRelativePath() {
System.out.println(SegmentGenerationUtils.getRelativeOutputPath(
URI.create("<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl>"),
URI.create("<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl/2019/05/18/guest_sdr_gst_data_sgl_data_2019-05-18.avro>"),
URI.create("<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl_staging>")));
}```
<br><strong>@fx19880617: </strong>the relative output is
`<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl_staging/2019/05/18/>`<br><strong>@fx19880617:
</strong>oh, I think the issue is that `inputFile` is not under `base input
path`<br><strong>@fx19880617: </strong>I’m using `URI relativePath =
baseInputDir.relativize(inputFile);` to get the relative
path<br><strong>@fx19880617: </strong>which will be empty in this
case<br><h3><u>#pinot-s3</u></h3><br><strong>@pradeepgv42:
</strong>@pradeepgv42 has joined the channel<br><strong>@pradeepgv42:
</strong>@pradeepgv42 has left the
channel<br><h3><u>#presto-pinot-streaming</u></h3><br><strong>@sosyalmedya.oguzhan:
</strong>@sosyalmedya.oguzhan has joined the channel<br><strong>@g.kishore:
</strong><!here> please meet @sosyalmedya.oguzhan<br><strong>@g.kishore:
</strong>he is interested in contributing to the streaming end point as
well<br><strong>@g.kishore: </strong>he is building a connector from
Spark<br><strong>@elon.azoulay: </strong>Hi
@sosyalmedya.oguzhan!<br><strong>@sosyalmedya.oguzhan: </strong>Hi to
everyone!<br><strong>@g.kishore: </strong>@sosyalmedya.oguzhan please share
your work on spark pinot connector<br><strong>@sosyalmedya.oguzhan:
</strong>I'm currently working as data engineer. And i saw that some
technologies like pinot is so important for real-time analyses. But int the etl
processes, generally, spark is used. For example in my company, we are creating
reports for our sellers, and these reports generally includes olap queries. As
a result, we have to write data to pinot, and we want to update or reindex data
in pinot etc. In this case, if we can write/read data from/to pinot, we have
very flexibility. Because we suffered a lot about this topic while using druid.
Pinot is so powerful, and we want to contribute to pinot for that.
Spark-pinot read section is written for now. We want to improve it with using
streaming endpoint. Write section will also added when new segment write api
will be released. If you want to check code, please look below link.
Note: I will create pull request to pinot for this connector.
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMZdBcgXzX9kzFVfeA5GrHiMVp6FGDB9iBEJx0fu7YX-2BMf8-2BDxGfgNqPvHcp2-2FCwFKJ95tIgWDMU-2FUYW1XTCDWRbZXZ9KGpHbte89NwvbImIoccG0z9ihDr-2Bb7tahSCq21g-3D-3DqTSK_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwWLyvlC1G9Lt9NOGPi2fUaiL-2F9-2F7IGEsugo7ZwhZgK2lI2t-2BG5tPFtVaToLfqwvNhoezTD9WkeOXIdMkU4zv7x4gKpDKenKEs0l06pALEfHADAmPShs-2FKiorYdZ5Erp82qCy1kHTXU-2BS-2BsKeOqYu21yBKJkivetA0kYVpPQcQuDXqzRdoW-2BUzWnu-2F5yaCU6Bc-3D><br>