Apache Pinot Daily Email Digest (2020-07-01)

Pinot Slack Email Digest Wed, 01 Jul 2020 19:01:17 -0700

<h3><u>#general</u></h3><br><strong>@ejankowski_pinot: 
</strong>@ejankowski_pinot has joined the channel<br><strong>@damianoporta: 
</strong>Hello, i am going to pay for two 
<https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYZM7UL-2FpxzX1gS-2B3SVCw6jyWkV-2BUJdZFGULHJiOo7f-2BX9m-3_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTxCEj3oL4zhLSDAS-2Bozt9-2BVoDDDzJLvxZJWblVN1BZqbHahtnt1GjlwoFpFUnklDSkwfVZIGGkRMP2LUIbhzXySiyQwoM2vSF-2FrLTDKVLJrLTpztu-2BFePMB07X9-2FhIY-2F7y94NkYVFpuHDu1cc0gW-2FaDrawhr-2FUh4a8x5C6ZS7yppeTV9yX5VoG96VN0DD-2FJVVU-3D>
 cloud servers.... to set up the cluster as explained in the doc (Thanks 
@npawar). I have a doubt, they offer vCPU like most of cloud services. Could it 
be good? has anyone ever used their service?<br><strong>@aruncthomas: 
</strong>@aruncthomas has joined the channel<br><strong>@pyne.suvodeep: 
</strong>@pyne.suvodeep has joined the 
channel<br><h3><u>#random</u></h3><br><strong>@ejankowski_pinot: 
</strong>@ejankowski_pinot has joined the channel<br><strong>@aruncthomas: 
</strong>@aruncthomas has joined the channel<br><strong>@pyne.suvodeep: 
</strong>@pyne.suvodeep has joined the 
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@somanshu.jindal: 
</strong>@somanshu.jindal has joined the channel<br><strong>@quietgolfer: 
</strong>I'm having issues with slow queries.  I recently started moving away 
from the built in time columns to my own floored to utc_date.  Now my queries 
are taking 5 seconds over 80 mil rows (a lot slower than before).


 I removed some sensitive parts.
```metrics_offline_table_config.json: |-
    {
      "tableName": "metrics",
      "tableType":"OFFLINE",
      "segmentsConfig" : {
        "schemaName" : "metrics",
        "timeColumnName": "timestamp",
        "timeType": "MILLISECONDS",
        "retentionTimeUnit": "DAYS",
        "retentionTimeValue": "1461",
        "segmentPushType": "APPEND",
        "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
        "replication" : "1"
      },
      "tableIndexConfig" : {
        "loadMode"  : "MMAP",
        "noDictionaryColumns": ["impressions"],
        "starTreeIndexConfigs": [
          {
            "dimensionsSplitOrder": [
              "utc_date",
              "platform_id",
              "account_id",
              "campaign_id"
            ],
            "skipStarNodeCreationForDimensions": [
            ],
            "functionColumnPairs": [
              "SUM__impressions",
            ]
          }
        ]
      },
      "tenants" : {},
      "metadata": {
        "customConfigs": {}
      }
    }```
The query I'm running looks pretty basic.  It's asking for aggregate stats at a 
high-level.  In my data, there are 8 unique utc_dates and 1 unique platform.
```select utc_date, sum(impressions) from metrics where platform_id = 13 group 
by utc_date```
Recent changes:
• switched from timestamp to my own utc_date (long).
• added `"noDictionaryColumns": ["impressions"],`
This previously was 50ms-100ms.

I'm going to bed now.  No need to rush an answer.<br><strong>@damianoporta: 
</strong>Hello everybody! I need support to set up a cluster. I followed the 
instructions explained by @npawar in her video. Everything works as expected 
but i did that test locally.
Now, I should organize all the components inside a real cluster that has 2/3 
servers. I need to understand how to organize the components for an 
high-availability architecture.
Obviously, i am talking about a very small cluster, so take "high-availability" 
with a grain of salt :)
My doubt is regarding the distribution of the components over the servers.
For example, seeing the video, the Zookeeper instance is just one, we start it 
with `pinot-admin.sh StartZookeeper -zkPort 2181` so the first question is: 
what about if the server with Zookeeper goes down? Can we share two or more 
zookeeper instances over multiple servers? Supposing we can  create multiple 
zookeeper instances does every machine should also have its own Controller, 
Broker and Server components? Because having more than one broker/controller on 
the same machine does not have much sense to me, maybe for very high traffic? 
Could someone explain it a little bit more? Thanks.<br><strong>@quietgolfer: 
</strong>I'm guessing my latency issue is related to a lack of disk.  The 
ingestion job still succeeded as successful even though I ran into disk issues 
on my pinot-server.<br><strong>@g.kishore: </strong>@quietgolfer can you paste 
the response stats<br><strong>@g.kishore: </strong>ingestion job will succeed 
as long as the data gets uploaded via controller api and stored in deep 
store<br><strong>@g.kishore: </strong>servers can pick it up any 
time<br><strong>@quietgolfer: </strong>Interesting.  Is there a way to force 
the servers to pick it up again after it failed to process internally?  I just 
increased disk and tried again and it worked.<br><strong>@g.kishore: 
</strong>yes, thats the way its supposed to work<br><strong>@g.kishore: 
</strong>restart will work<br><strong>@g.kishore: </strong>or a reset command 
for the segment in ERROR state<br><strong>@quietgolfer: </strong>Cool, 
ty<br><strong>@g.kishore: </strong>so the latency was also related to 
this?<br><strong>@cinto: </strong>@cinto has joined the 
channel<br><strong>@cinto: </strong>Hi Team,
I just installed Pinot locally and I ran the script
```./bin/quick-start-batch.sh```
It is throwing an error:
```***** Offline quickstart setup complete *****
Total number of documents in the table
Query : select count(*) from baseballStats limit 0
Executing command: PostQuery -brokerHost 127.0.0.1 -brokerPort 8000 -queryType 
pql -query select count(*) from baseballStats limit 0
Exception in thread "main" java.lang.NullPointerException
        at 
org.apache.pinot.tools.Quickstart.prettyPrintResponse(Quickstart.java:75)
        at org.apache.pinot.tools.Quickstart.execute(Quickstart.java:174)
        at org.apache.pinot.tools.Quickstart.main(Quickstart.java:207)```
This is how the UI looks like<br><strong>@aruncthomas: </strong>@aruncthomas 
has joined the channel<br><strong>@quietgolfer: </strong>I'm hitting a slow 
query case where my combined offline/realtime table `metrics` is slow to query 
but the individual `metrics_OFFLINE` and `metrics_REALTIME` are quick to query 
separately.  Any ideas?

```select utc_date, sum(impressions) from metrics_OFFLINE where utc_date &gt;= 
1591142400000 and utc_date &lt; 1593648000000 group by utc_date order by 
utc_date ASC limit 1831```
This returns pretty fast (200ms) over a lot of 400mil rows.

If I switch to `metrics_REALTIME` , it's also fast and returns zero rows.
```select utc_date, sum(impressions) from metrics_REALTIME where utc_date &gt;= 
1591142400000 and utc_date &lt; 1593648000000 group by utc_date order by 
utc_date ASC limit 1831```
However, if I query `metrics`, it's very slow.
```select utc_date, sum(impressions) from metrics where utc_date &gt;= 
1591142400000 and utc_date &lt; 1593648000000 group by utc_date order by 
utc_date ASC limit 1831```<br><strong>@pradeepgv42: </strong>Hi, “select * from 
&lt;table&gt; order by &lt;column&gt; limit 10” is timing out
I have ~40M rows and ~44columns and data spread across two machines
```
{
    "exceptions": [],
    "numServersQueried": 2,
    "numServersResponded": 0,
    "numSegmentsQueried": 0,
    "numSegmentsProcessed": 0,
    "numSegmentsMatched": 0,
    "numConsumingSegmentsQueried": 0,
    "numDocsScanned": 0,
    "numEntriesScannedInFilter": 0,
    "numEntriesScannedPostFilter": 0,
    "numGroupsLimitReached": false,
    "totalDocs": 0,
    "timeUsedMs": 9999,
    "segmentStatistics": [],
    "traceInfo": {},
    "minConsumingFreshnessTimeMs": 0
}```
Close to ~34 segments and all of them seem to be in either “ONLINE” or 
“CONSUMING” state
```
I just see a timeout exception on one of the server logs
Caught TimeoutException. (brokerRequest = 
BrokerRequest(querySource:QuerySource(tableName:searchtable_REALTIME), 
selections:Selection(se
lectionColumns:[*], 
selectionSortSequence:[SelectionSort(column:timestampMillis, isAsc:true)], 
size:10), enableTrace:true, queryOptions:{re
sponseFormat=sql, groupByMode=sql, timeoutMs=10000}, 
pinotQuery:PinotQuery(dataSource:DataSource(tableName:searchtable), 
selectList:[Exp
ression(type:IDENTIFIER, identifier:Identifier(name:*))], 
orderByList:[Expression(type:FUNCTION, functionCall:Function(operator:ASC, 
operan
ds:[Expression(type:IDENTIFIER, 
identifier:Identifier(name:timestampMillis))]))], limit:10), 
orderBy:[SelectionSort(column:timestampMillis,
 isAsc:true)], limit:10))
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.FutureTask.get(FutureTask.java:205) 
~[?:1.8.0_252]
        at 
org.apache.pinot.core.operator.CombineOperator.getNextBlock(CombineOperator.java:169)
 ~[pinot-all-0.4.0-jar-with-dependencies.ja
r:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
        at 
org.apache.pinot.core.operator.CombineOperator.getNextBlock(CombineOperator.java:47)
 ~[pinot-all-0.4.0-jar-with-dependencies.jar
:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
        at 
org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:42) 
~[pinot-all-0.4.0-jar-with-dependencies.jar:0.4.0-83
55d2e0e489a8d127f2e32793671fba505628a8]```
Wondering if there is a way to improve the query latency? (tried with small 
subset of columns, query retunrs results)<br><strong>@g.kishore: </strong>is 
the timestampinMillis dictionary encoded?<br><strong>@g.kishore: </strong>can 
you make it noDictionaryColumns<br><strong>@pradeepgv42: </strong>got it, 
thanks let me try that<br><strong>@g.kishore: </strong>there is an optimization 
that we can do specifically for time column sorting<br><strong>@g.kishore: 
</strong>I remember Uber folks also suggesting this<br>

Apache Pinot Daily Email Digest (2020-07-01)

Reply via email to