Apache Pinot Daily Email Digest (2020-07-24)

Pinot Slack Email Digest Fri, 24 Jul 2020 19:00:20 -0700

<h3><u>#general</u></h3><br><strong>@mhomaid: </strong>@afilipchik @g.kishore 
Updated the Kafka Sink connector and added more configurations for batch and 
streaming. I also added initial code for generating the pinot segment. The 
segment generator is boilerplates at this moment. I would appreciate your 
guidance there. I didn’t want to go to far without reviewing it with you and 
ensure we are on the right path. The repo is @ 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMfbIlUXR74NmTijr4m8wCL51vLI5sH-2FJD7NmNJL2-2FNe7cHP3uilgYBKiJN-2F9L1x-2Bgw-3D-3DhGU2_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwWLyvlC1G9Lt9NOGPi2fUatzLJah4Q-2BMmypLeqjn1BHn1SGF4TGtZrhYb-2B7oOmE93sTg-2FK9waYaTFodzat4EhYtA15tZWhpmQre3B-2Fv1LiAGFkNj0YJPiOF4zWes7uSfX81bs57VCD3A0F3OrUqEJSyuA70B7HURkCEgX6OGotJxuWY1O6UY9O-2BJlOtSr16IA-3D><br><strong>@g.kishore:
 </strong>Thanks @mhomaid Will take a look at it. Let’s create an issue and 
start discussion?<br><strong>@mhomaid: </strong>Great<br><strong>@ajit.kumar: 
</strong>@ajit.kumar has joined the 
channel<br><h3><u>#random</u></h3><br><strong>@ajit.kumar: </strong>@ajit.kumar 
has joined the 
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@yash.agarwal: 
</strong>I am getting
```java.lang.IllegalStateException: Unable to extract out the relative path 
based on base input path: 
<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl>
        at 
shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
        at 
org.apache.pinot.plugin.ingestion.batch.common.SegmentGenerationUtils.getRelativeOutputPath(SegmentGenerationUtils.java:144)
        at 
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner$1.call(SparkSegmentGenerationJobRunner.java:292)
        at 
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner$1.call(SparkSegmentGenerationJobRunner.java:214)```
the job config is
```inputDirURI: 
'<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl>'
outputDirURI: 
'<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl_segments>'```<br><strong>@fx19880617:
 </strong>do you know the input path<br><strong>@fx19880617: </strong>we get 
the input file like 
`<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl/a/b/c.avro>`<br><strong>@fx19880617:
 </strong>then output segment path should be 
`<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl_segments/a/b/c.tar.gz>`<br><strong>@fx19880617:
 </strong>we try to extract the relative path of 
`a/b/c.avro`<br><strong>@fx19880617: </strong>so wanna check the input file 
path<br><strong>@yash.agarwal: </strong>the input path something like
```<hdfs://bigredns/apps/hive/warehouse/dev_phx_chargers.db/guest_sdr_gst_data_sgl/partition_d=2020-05-17/00000_0>```<br><strong>@yash.agarwal:
 </strong>ill try making it more like the format you 
mentioned.<br><strong>@fx19880617: </strong>hmm so file name is 
`00000_0`?<br><strong>@fx19880617: </strong>here is some sample 
layout:<br><strong>@fx19880617: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMSfW2QiSG4bkQpnpkSL7FiK3MHb8libOHmhAW89nP5XK3RW6mQznmb3W3HzGtaNDxBoE2C01zOHxwoGE8C8WcXMKqmLEXWoT-2FaOlbIgsgPVQISCnrNolafFFNTC0-2FrKbsWW-2FuuIZC3Y-2FrK3js1bQ-2Bx9LQeS4MkyAZCXjp6-2BTpt6k2mqH_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwWLyvlC1G9Lt9NOGPi2fUa8iLj9XcK-2BbB3WKrBCovOKMKIceq6X3O7ZcUK7hay0l8Z6zsLzJsA88RrB6LAKArkdkW4tS8vK87kYy7LbMfKKq88KSM1Dyt7nYGB2qn9WO4D7-2FaLgSxrL1K6ix65N1YXGKybhVl03mWeWm3dPl5AFgwz0jWAkrydQTAyqRwW5kU-3D><br><strong>@yash.agarwal:
 </strong>I am getting the same issue after converting it into this 
format.<br><strong>@yash.agarwal: 
</strong>```<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl/2019/05/18/guest_sdr_gst_data_sgl_data_2019-05-18.avro>```<br><strong>@yash.agarwal:
 </strong>```Unable to extract out the relative path based on base input path: 
<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl>```<br><strong>@yash.agarwal:
 </strong>I was able to log the inputfile and output dir with 
it.<br><strong>@yash.agarwal: </strong>```java.lang.IllegalStateException: 
Unable to extract out the relative path based on base input path: 
<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl> inputFile: 
hdfs:/user/Z00290G/guest_sdr_gst_data_sgl/2019/05/21/guest_sdr_gst_data_sgl_data_2019-05-21.avro
 outputDir: 
<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl_staging>```<br><strong>@yash.agarwal:
 </strong>it is removing the `<hdfs://bigredns>` from the input 
file.<br><strong>@fx19880617: </strong>I wrote a simple test
```
  @Test
  public void testGetRelativePath() {


    System.out.println(SegmentGenerationUtils.getRelativeOutputPath(
        URI.create("<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl>"),
        
URI.create("<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl/2019/05/18/guest_sdr_gst_data_sgl_data_2019-05-18.avro>"),
        
URI.create("<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl_staging>")));
  }```
<br><strong>@fx19880617: </strong>the relative output is 
`<hdfs://bigredns/user/Z00290G/guest_sdr_gst_data_sgl_staging/2019/05/18/>`<br><strong>@fx19880617:
 </strong>oh, I think the issue is that `inputFile` is not under `base input 
path`<br><strong>@fx19880617: </strong>I’m using `URI relativePath = 
baseInputDir.relativize(inputFile);` to get the relative 
path<br><strong>@fx19880617: </strong>which will be empty in this 
case<br><h3><u>#pinot-s3</u></h3><br><strong>@pradeepgv42: 
</strong>@pradeepgv42 has joined the channel<br><strong>@pradeepgv42: 
</strong>@pradeepgv42 has left the 
channel<br><h3><u>#presto-pinot-streaming</u></h3><br><strong>@sosyalmedya.oguzhan:
 </strong>@sosyalmedya.oguzhan has joined the channel<br><strong>@g.kishore: 
</strong><!here> please meet @sosyalmedya.oguzhan<br><strong>@g.kishore: 
</strong>he is interested in contributing to the streaming end point as 
well<br><strong>@g.kishore: </strong>he is building a connector from 
Spark<br><strong>@elon.azoulay: </strong>Hi 
@sosyalmedya.oguzhan!<br><strong>@sosyalmedya.oguzhan: </strong>Hi to 
everyone!<br><strong>@g.kishore: </strong>@sosyalmedya.oguzhan please share 
your work on spark pinot connector<br><strong>@sosyalmedya.oguzhan: 
</strong>I'm currently working as data engineer. And i saw that some 
technologies like pinot is so important for real-time analyses. But int the etl 
processes, generally, spark is used. For example in my company, we are creating 
reports for our sellers, and these reports generally includes olap  queries. As 
a result, we have to write data to pinot, and we want to update or reindex data 
in pinot etc. In this case, if we can write/read data from/to pinot, we have 
very flexibility. Because we suffered a lot about this topic while using druid. 
Pinot is so powerful, and we want to contribute to pinot for that. 

Spark-pinot read section is written for now. We want to improve it with using 
streaming endpoint. Write section will also added when new segment write api 
will be released. If you want to check code, please look below link. 

Note: I will create pull request to pinot for this connector.

<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMZdBcgXzX9kzFVfeA5GrHiMVp6FGDB9iBEJx0fu7YX-2BMf8-2BDxGfgNqPvHcp2-2FCwFKJ95tIgWDMU-2FUYW1XTCDWRbZXZ9KGpHbte89NwvbImIoccG0z9ihDr-2Bb7tahSCq21g-3D-3DqTSK_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwWLyvlC1G9Lt9NOGPi2fUaiL-2F9-2F7IGEsugo7ZwhZgK2lI2t-2BG5tPFtVaToLfqwvNhoezTD9WkeOXIdMkU4zv7x4gKpDKenKEs0l06pALEfHADAmPShs-2FKiorYdZ5Erp82qCy1kHTXU-2BS-2BsKeOqYu21yBKJkivetA0kYVpPQcQuDXqzRdoW-2BUzWnu-2F5yaCU6Bc-3D><br>

Apache Pinot Daily Email Digest (2020-07-24)

Reply via email to