<h3><u>#general</u></h3><br><strong>@ejankowski_pinot:
</strong>@ejankowski_pinot has joined the channel<br><strong>@damianoporta:
</strong>Hello, i am going to pay for two
<https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYZM7UL-2FpxzX1gS-2B3SVCw6jyWkV-2BUJdZFGULHJiOo7f-2BX9m-3_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTxCEj3oL4zhLSDAS-2Bozt9-2BVoDDDzJLvxZJWblVN1BZqbHahtnt1GjlwoFpFUnklDSkwfVZIGGkRMP2LUIbhzXySiyQwoM2vSF-2FrLTDKVLJrLTpztu-2BFePMB07X9-2FhIY-2F7y94NkYVFpuHDu1cc0gW-2FaDrawhr-2FUh4a8x5C6ZS7yppeTV9yX5VoG96VN0DD-2FJVVU-3D>
cloud servers.... to set up the cluster as explained in the doc (Thanks
@npawar). I have a doubt, they offer vCPU like most of cloud services. Could it
be good? has anyone ever used their service?<br><strong>@aruncthomas:
</strong>@aruncthomas has joined the channel<br><strong>@pyne.suvodeep:
</strong>@pyne.suvodeep has joined the
channel<br><h3><u>#random</u></h3><br><strong>@ejankowski_pinot:
</strong>@ejankowski_pinot has joined the channel<br><strong>@aruncthomas:
</strong>@aruncthomas has joined the channel<br><strong>@pyne.suvodeep:
</strong>@pyne.suvodeep has joined the
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@somanshu.jindal:
</strong>@somanshu.jindal has joined the channel<br><strong>@quietgolfer:
</strong>I'm having issues with slow queries. I recently started moving away
from the built in time columns to my own floored to utc_date. Now my queries
are taking 5 seconds over 80 mil rows (a lot slower than before).
I removed some sensitive parts.
```metrics_offline_table_config.json: |-
{
"tableName": "metrics",
"tableType":"OFFLINE",
"segmentsConfig" : {
"schemaName" : "metrics",
"timeColumnName": "timestamp",
"timeType": "MILLISECONDS",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "1461",
"segmentPushType": "APPEND",
"segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
"replication" : "1"
},
"tableIndexConfig" : {
"loadMode" : "MMAP",
"noDictionaryColumns": ["impressions"],
"starTreeIndexConfigs": [
{
"dimensionsSplitOrder": [
"utc_date",
"platform_id",
"account_id",
"campaign_id"
],
"skipStarNodeCreationForDimensions": [
],
"functionColumnPairs": [
"SUM__impressions",
]
}
]
},
"tenants" : {},
"metadata": {
"customConfigs": {}
}
}```
The query I'm running looks pretty basic. It's asking for aggregate stats at a
high-level. In my data, there are 8 unique utc_dates and 1 unique platform.
```select utc_date, sum(impressions) from metrics where platform_id = 13 group
by utc_date```
Recent changes:
• switched from timestamp to my own utc_date (long).
• added `"noDictionaryColumns": ["impressions"],`
This previously was 50ms-100ms.
I'm going to bed now. No need to rush an answer.<br><strong>@damianoporta:
</strong>Hello everybody! I need support to set up a cluster. I followed the
instructions explained by @npawar in her video. Everything works as expected
but i did that test locally.
Now, I should organize all the components inside a real cluster that has 2/3
servers. I need to understand how to organize the components for an
high-availability architecture.
Obviously, i am talking about a very small cluster, so take "high-availability"
with a grain of salt :)
My doubt is regarding the distribution of the components over the servers.
For example, seeing the video, the Zookeeper instance is just one, we start it
with `pinot-admin.sh StartZookeeper -zkPort 2181` so the first question is:
what about if the server with Zookeeper goes down? Can we share two or more
zookeeper instances over multiple servers? Supposing we can create multiple
zookeeper instances does every machine should also have its own Controller,
Broker and Server components? Because having more than one broker/controller on
the same machine does not have much sense to me, maybe for very high traffic?
Could someone explain it a little bit more? Thanks.<br><strong>@quietgolfer:
</strong>I'm guessing my latency issue is related to a lack of disk. The
ingestion job still succeeded as successful even though I ran into disk issues
on my pinot-server.<br><strong>@g.kishore: </strong>@quietgolfer can you paste
the response stats<br><strong>@g.kishore: </strong>ingestion job will succeed
as long as the data gets uploaded via controller api and stored in deep
store<br><strong>@g.kishore: </strong>servers can pick it up any
time<br><strong>@quietgolfer: </strong>Interesting. Is there a way to force
the servers to pick it up again after it failed to process internally? I just
increased disk and tried again and it worked.<br><strong>@g.kishore:
</strong>yes, thats the way its supposed to work<br><strong>@g.kishore:
</strong>restart will work<br><strong>@g.kishore: </strong>or a reset command
for the segment in ERROR state<br><strong>@quietgolfer: </strong>Cool,
ty<br><strong>@g.kishore: </strong>so the latency was also related to
this?<br><strong>@cinto: </strong>@cinto has joined the
channel<br><strong>@cinto: </strong>Hi Team,
I just installed Pinot locally and I ran the script
```./bin/quick-start-batch.sh```
It is throwing an error:
```***** Offline quickstart setup complete *****
Total number of documents in the table
Query : select count(*) from baseballStats limit 0
Executing command: PostQuery -brokerHost 127.0.0.1 -brokerPort 8000 -queryType
pql -query select count(*) from baseballStats limit 0
Exception in thread "main" java.lang.NullPointerException
at
org.apache.pinot.tools.Quickstart.prettyPrintResponse(Quickstart.java:75)
at org.apache.pinot.tools.Quickstart.execute(Quickstart.java:174)
at org.apache.pinot.tools.Quickstart.main(Quickstart.java:207)```
This is how the UI looks like<br><strong>@aruncthomas: </strong>@aruncthomas
has joined the channel<br><strong>@quietgolfer: </strong>I'm hitting a slow
query case where my combined offline/realtime table `metrics` is slow to query
but the individual `metrics_OFFLINE` and `metrics_REALTIME` are quick to query
separately. Any ideas?
```select utc_date, sum(impressions) from metrics_OFFLINE where utc_date >=
1591142400000 and utc_date < 1593648000000 group by utc_date order by
utc_date ASC limit 1831```
This returns pretty fast (200ms) over a lot of 400mil rows.
If I switch to `metrics_REALTIME` , it's also fast and returns zero rows.
```select utc_date, sum(impressions) from metrics_REALTIME where utc_date >=
1591142400000 and utc_date < 1593648000000 group by utc_date order by
utc_date ASC limit 1831```
However, if I query `metrics`, it's very slow.
```select utc_date, sum(impressions) from metrics where utc_date >=
1591142400000 and utc_date < 1593648000000 group by utc_date order by
utc_date ASC limit 1831```<br><strong>@pradeepgv42: </strong>Hi, “select * from
<table> order by <column> limit 10” is timing out
I have ~40M rows and ~44columns and data spread across two machines
```
{
"exceptions": [],
"numServersQueried": 2,
"numServersResponded": 0,
"numSegmentsQueried": 0,
"numSegmentsProcessed": 0,
"numSegmentsMatched": 0,
"numConsumingSegmentsQueried": 0,
"numDocsScanned": 0,
"numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 0,
"numGroupsLimitReached": false,
"totalDocs": 0,
"timeUsedMs": 9999,
"segmentStatistics": [],
"traceInfo": {},
"minConsumingFreshnessTimeMs": 0
}```
Close to ~34 segments and all of them seem to be in either “ONLINE” or
“CONSUMING” state
```
I just see a timeout exception on one of the server logs
Caught TimeoutException. (brokerRequest =
BrokerRequest(querySource:QuerySource(tableName:searchtable_REALTIME),
selections:Selection(se
lectionColumns:[*],
selectionSortSequence:[SelectionSort(column:timestampMillis, isAsc:true)],
size:10), enableTrace:true, queryOptions:{re
sponseFormat=sql, groupByMode=sql, timeoutMs=10000},
pinotQuery:PinotQuery(dataSource:DataSource(tableName:searchtable),
selectList:[Exp
ression(type:IDENTIFIER, identifier:Identifier(name:*))],
orderByList:[Expression(type:FUNCTION, functionCall:Function(operator:ASC,
operan
ds:[Expression(type:IDENTIFIER,
identifier:Identifier(name:timestampMillis))]))], limit:10),
orderBy:[SelectionSort(column:timestampMillis,
isAsc:true)], limit:10))
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
~[?:1.8.0_252]
at
org.apache.pinot.core.operator.CombineOperator.getNextBlock(CombineOperator.java:169)
~[pinot-all-0.4.0-jar-with-dependencies.ja
r:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
at
org.apache.pinot.core.operator.CombineOperator.getNextBlock(CombineOperator.java:47)
~[pinot-all-0.4.0-jar-with-dependencies.jar
:0.4.0-8355d2e0e489a8d127f2e32793671fba505628a8]
at
org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:42)
~[pinot-all-0.4.0-jar-with-dependencies.jar:0.4.0-83
55d2e0e489a8d127f2e32793671fba505628a8]```
Wondering if there is a way to improve the query latency? (tried with small
subset of columns, query retunrs results)<br><strong>@g.kishore: </strong>is
the timestampinMillis dictionary encoded?<br><strong>@g.kishore: </strong>can
you make it noDictionaryColumns<br><strong>@pradeepgv42: </strong>got it,
thanks let me try that<br><strong>@g.kishore: </strong>there is an optimization
that we can do specifically for time column sorting<br><strong>@g.kishore:
</strong>I remember Uber folks also suggesting this<br>