Apache Pinot Daily Email Digest (2020-08-20)

Pinot Slack Email Digest Thu, 20 Aug 2020 19:00:22 -0700
<h3><u>#general</u></h3><br><strong>@katneniravikiran: </strong>Pinot is taking 
long time to import data when the data size is huge. I am using "standalone" 
data load job. Trying with 80GB TPCH Lineitem data split into 600 files(each 
file is around 130MB). Creating segment file is taking around 3 hours on a 4 
CPU 64GB RAM machine.  Is this expected behavior?<br><strong>@mayanks: 
</strong>How many controllers do you have? Are you pushing files 
sequentially?<br><strong>@mayanks: </strong>Using deep-store with segment-uri 
push will help reduce the time, by avoiding to have to push the actual 
payload<br><strong>@katneniravikiran: </strong>Two 
controllers<br><strong>@katneniravikiran: </strong>Can you help in finding the 
documentation for "deep-store with segment-uri push" ?<br><strong>@mayanks: 
</strong>Here's a sample: 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdo5Vml3IIBLs7JiV3aA7xH6E1xuyunoVEcnqa9dbWn9CIJQSKp-2BZWo-2BWye-2FipnOZiQLReT6c2Mzg50KdMUPKcFdVOMaIzyscgjS7oqxfmp915p3jVDV-2FG594Eqj0lJbWEW6V77mINHj3FYdWSgpfbXYrwXW_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTylyTc9khVGqOPJLveBssI2aIj6ojn6scFUopLefrXk7X8JWqO8jPV0rvh2IlKpNnSm-2FbwDtyAuFXi-2Bx2i6pHbTGZf4sbtR6kGGxtojcVwKFf9eNnMRQR7BvR5YsCJpjSpGIbMrppIPX8rGoPW3Yt6p0PMEhU5GepOuvZDGv5o2kDcxPVocWlViieqE6Y-2FqGWw-3D><br><strong>@katneniravikiran:
 </strong>Is option other than using HDFS?<br><strong>@mayanks: </strong>yeah, 
you can also use gcs, s3<br><strong>@katneniravikiran: 
</strong>Ok,thanks<br><strong>@mayanks: </strong>May I ask what are you trying 
to achieve? Is this a benchmark?<br><strong>@mayanks: </strong>Actually, looks 
like even when using deep-store, the controller may still need to download the 
segments (metadata push may not be supported yet)<br><strong>@katneniravikiran: 
</strong>Yes, we are trying to benchmark, Presto-Pinot combination for 10GB to 
200GB TPCH data. We want good query performance with fast loading capability. 
When compared with other OLAP dbs, Pinot seems to be taking long time for 
loading data. One observation is, the standalone job is using a single CPU(not 
sure how many threads) for a single upload job, even when there are having 
multiple files in import folder. Other OLAP dbs seem to load data using more 
than one CPU. Is there any setting to make Pinot import job use more than one 
CPU? Using HDFS ,S3 or GC is not in the scope of bench marking, We want to 
minimize the dependency on Hadoop or other Big systems because the data sizes 
we are targeting are not truly BigData terriroty.<br><strong>@mayanks: 
</strong>Ok, then we need to understand where the time is spent. Is it on index 
generation or actual push? <br><strong>@katneniravikiran: </strong>Indexing is 
taking time. Push is relatively faster.<br><strong>@mayanks: </strong>Hmm then 
it should be easy to make the job multi threaded, if it isn’t 
already<br><strong>@g.kishore: </strong>There is a parallelism setting 
<br><h3><u>#feat-presto-connector</u></h3><br><strong>@christian: 
</strong>@christian has joined the 
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@yash.agarwal: 
</strong>I am using presto for querying and joining results from Pinot, What is 
the recommended approach to do multiple aggregations like the following in 
single query ?
```select channel,
    sales_date,
    sum(sales) as sum_sales,
    sum(units) as sum_units
from pinot.default.sales
group by channel, sales_date```
Currently presto is trying to fetch raw values for all the 
columns.<br><strong>@yash.agarwal: </strong>Also how can I use custom Pinot 
UDFs like segmentPartitionedDistinctCount in presto queries 
?<br><strong>@g.kishore: </strong>@yash.agarwal yes, @fx19880617 lets enable 
allow-multiple-aggregations by default in presto pinot 
connector<br><strong>@fx19880617: </strong>will do<br><strong>@mailtobuchi: 
</strong>Hey everyone, `DISTINCTCOUNT` queries on raw data from realtime tables 
seems to be very slow. Tried the HLL approximation but that didn’t help. If we 
were to be okay with approximated results, would you recommend `Theta 
Sketches`? Is that generally faster than the HLL?<br><strong>@mayanks: 
</strong>HLL is faster than T/S<br><strong>@mayanks: </strong>T/S is better if 
you want to do set operations like 
intersect/union/difference<br><strong>@mayanks: </strong>HLL or T/S helps if 
pre-aggregate, which you can't for RT<br><strong>@mailtobuchi: </strong>Hmm.. 
For most of our queries, both `DISTINCTCOUNT`  and `HLL` are equally slow. Are 
there any optimizations that we can do to improve the 
latencies?<br><strong>@mayanks: </strong>what's the 
numDocsScanned?<br><strong>@mayanks: </strong>A good feature ask `Aggregating 
HLL T/S derived columns during consumption` (cc: 
@jackie.jxt)<br><strong>@jackie.jxt: </strong>We should support all the 
aggregations available in `ValueAggregator` for aggregation during 
consumption<br><strong>@jackie.jxt: </strong>FYI, that's the aggregations 
supported by 
star-tree<br><h3><u>#pinot-0-5-0-release</u></h3><br><strong>@tingchen: 
</strong>@fx19880617 can you take look at 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMSfW2QiSG4bkQpnpkSL7FiK3MHb8libOHmhAW89nP5XK4rP-2BkFe5YEFfdRMoaAM6kg-3D-3DosIx_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTylyTc9khVGqOPJLveBssI2gTFYpV9Y9Za9IhPmwUCSGxYYjYF1ZdZvjwwWiSIkDyBRsYDIrn6BgbGi-2B8PYNgOppkWVDt7qHyE6Yo-2FEu3ElEaI7OROfz6-2FeWqfn6ng5BQSnjmXeJODYM1jSqye-2F7ghIYTcJCR93RJKyA9C4gRhaMHFbJNluLdWoxXaA-2BoWg6a0-3D>?
 for the license and notification file changes?<br><strong>@tingchen: 
</strong>thanks.<br><strong>@fx19880617: 
</strong>Sure<br><h3><u>#lp-pinot-poc</u></h3><br><strong>@andrew: 
</strong><!here> thanks for everyone’s help so far. I invited you all to a 
Github project where i’ve put the cluster setup. Let me know if you are able to 
provide further assistance and if you have any 
questions.<br><strong>@g.kishore: </strong>thanks Andrew, we will try it this 
week and get back to you<br>
Apache Pinot Daily Email Digest (2020-08-20)

Reply via email to