#general
@pabraham.usa: anyone got a performance comparison of storing index in HDD/ SSD / EFS / NFS ? As usual I assume SSD will be the high performer?
@mayanks: Two observations that I have seen. If data fits in memory, HDD and SSD performance is similar in terms of query latency. When data starts spilling over to SSD can make a lot of difference. We were able to push it to full capacity for read throughput for a use case to get ms latency.
@mayanks: As expected of course, but good to establish that in practice as well
@mayanks: We typically NFS as a deep store (which is not in the read path), and not as attached disk for serving nodes
@ssubrama: @pabraham.usa one more to add to Mayank's observation. If you have HDD, be prepared to take a latency hit when new segments are loaded. Even if segments are replaced _in situ_ and both of them fit in memory, we have seen high latency during the replacement time.
@ssubrama: Of course, it also depends on how much data you have, the kind of queries you run, etc.
@mayanks: Yep, HDDs have a cold start problem, especially when all data is being refreshed
#troubleshooting
@noahprince8:
@g.kishore: yes
@noahprince8: So how does one achieve autoscaling if you have to manually tag new servers?
@ken: I was running Pinot locally, and wanted to reload some revised data. So I first deleted all segments for the target table via `curl -v "
@mayanks: Hey @ken what do you see in the external view and ideal state?
@ken: Via Zookeeper inspection, or some other way of looking at that data? Sorry, just started working with Pinot…
@mayanks: Yes ZK inspector
@ken: for external view - brokerResource, table_OFFLINE, or leadControllerResource?
@ken: for crawldata_OFFLINE, the mapFields{} has every segment state as “ERROR”
@ken: I can dump JSON here if that’s what you’d prefer, let me know - thanks!
@ken: Sadly, I’ve restarted the services so I’m worried that the state I’m seeing now isn’t all that helpful to you. What’s the best way to reset everything to fresh, and then try to recreate my problem?
@ken: (this is for running Pinot locally, starting up each service via Bash)
@mayanks: You can nuke all the jvms if this is local and want to start with clean state
@mayanks: If your data size is large, perhaps the server is OOM’ing?
@mayanks: Segments go in error state when server cannot host them
@mayanks: Server log usually tells why
@ken: Hmm, if I do that and relaunch Zookeeper, I still see my previous table data in the inspector
@ken: Should I also delete the “PinotCluster” node in ZK?
@mayanks: Yes
@ken: So if OOM is the issue, bumping up server JVM (not broker/controller) is what I’d need to do, right?
@mayanks: Correct. What’s your total data size? If you load segments MMAPed (table config) the OOM shouldn’t happen
@ken: Table config says `“loadMode”: “MMAP”, `
@mayanks: Ok
@ken: Total segments *.tar.gz file sizes == 45MB, I had 2GB for server, bumped to 4GB
@ken: I restarted everything (ZK first, deleted the PinotCluster node, then everything else) and reimported. Seems to be OK now. i could try deleting all segments and re-importing again, just to see if that puts me into a weird state.
@ken: Though I’m happy now trying to craft queries against real data :slightly_smiling_face:
@mayanks: :+1:
@mayanks: Although, with MMAP, loading of segments should not OOM. Query processing happens on heap though
@ken: And if I check the logs, I see various errors logged by the server (and ZooKeeper). So is there a different approach I should have used to get rid of all of the old data before doing the re-import? This is all for offline data.
#time-based-segment-pruner
@snlee: let’s get the consensus on the time based pruner approach. When I talked with @g.kishore and @noahprince8 , they had the concern with O(n) naive approach so @jiapengtao0 did some research and has the implementation using interval tree. So, I see the following options: 1. naive approach 2. keep 2 sorted list (one with start time, the other with end time), use binary search, and perform intersection 3. interval tree
@snlee: @jackie.jxt had some concerns on memory usage using interval tree and he recommends to start with the naive approach
@snlee: @steotia
@steotia: @steotia has joined the channel
@mayanks: Can we make the decision data driven?
@snlee: @jiapengtao0 can u point the experiments that u did?
@g.kishore: Why is interval tree memory intensive?
@jackie.jxt: We need to use segment tree instead of interval tree
@jiatao: One second, let me rerun the simple bench mark.
@jackie.jxt: We should not get into a state where we need to manage millions of (logical) segments in one table
@jackie.jxt: Everything will break besides this pruner
@jiatao: Why segment tree instead of interval tree?
@jackie.jxt: Interval tree is basically the same as sorted list + binary search
@jackie.jxt: You need to maintain 2 of them to handle the search for both start and end
@jiatao: its augmented binary search tree, and the key is time interval, so we don't need 2 of them.
@jackie.jxt: When you sort the intervals, you can only firstly sort on one of start or end
@jackie.jxt: If you sort on start firstly, you cannot handle the search for start because the end is not sorted in the tree
@steotia: isn't the problem about finding segments with overlapping intervals. So given each interval from the filter tree, look for overlapping intervals and corresponding segments. Interval tree sounds like a candidate
@steotia: segment tree is better when you want to do point queries within a interval
@steotia: segment tree is not your traditional balanced BST physically.. it is an array
@steotia: so not sure how we can use it to find overlapping intervals
@jiatao:
@jiatao: I think there's some interval trees named quite similar, so may have confusions.
@jackie.jxt: Let me learn about it, based on the wiki, it can handle our requirement:
@steotia: both segment and interval tree store intervals... I think the difference is what you want to query... in this case we want to query overlapping intervals.. which segment tree can't answer
@jackie.jxt: Before this, I want to discuss if we want to handle the corner case of ZK failures
@jackie.jxt: Currently, if some segment ZK metadata is missing due to temporary ZK failures, we still route the segments. But that is not possible with this approach
@jiatao: Why?
@jiatao: If the ZK metadata is missing, we can set the interval as default [0, LONG.MAX_VALUE]
@jiatao: So it won't get filtered out.
@jackie.jxt: If the ZK metadata is missing, you won't know the existence of the segment, thus not able to put this default
@jackie.jxt: Oh, I took it back, you initialize the ZK metadata based on the online segments
@jackie.jxt: After reading the wiki, now I understand the algorithm. We definitely needs more comments explaining the algorithm, or put a pointer to the wiki
@jackie.jxt: Do you have the perf number comparing this approach with the naive one?
@jiatao: Yeah, i'm running it to print the result, maybe 3 more minutes
@noahprince8: > We should not get into a state where we need to manage millions of (logical) segments in one table > What else what break down at millions of segments?
@jackie.jxt: ZK, cluster management, segment size check etc
@jackie.jxt: I think the solution should be grouping multiple physical segments into one logical segment, and only store ZK metadata for logical segment, as @g.kishore suggested
@noahprince8: Yeah. May also be trying to make Pinot something it’s not. Pinot is very good at analyzing terrabytes of time series data. Might be square peg in a round hole trying to make it support petabytes. Perhaps traditional parquet files in s3 is better for 3+ month old data. Or something like that. I still think the segment cold storage and the time based pruners are nice features to have just for the cost save. Probably most of our tables can fit into that paradigm.
@jackie.jxt: My concern on the cold storage feature is that one query that hits the old data could invalid everything on the server cache. Just downloading all the old segments could take hours, and I don't think Pinot is designed for queries of such high latency.
@jackie.jxt: Ideally, we should abstract the segment spi so that the query engine can directly work on parquet files stored in the deep storage without downloading the whole files
@noahprince8: Yeah, I think 1) having limits on the number of segments you can query and 2) pruning aggressively helps.
@noahprince8: I agree, the ideal would be having it work directly on parquet files. Though at what point are you better off just using Presto with a UNION ALL query with a `WHERE date < 3 months` on one side and `WHERE data > 3 months` on the other side.
@g.kishore: This is not really terabyte vs petabytes, it’s more about metadata scale
@g.kishore: And specifically for one table
@g.kishore: If you use Presto, a simple file listing will have the same problem
@g.kishore: In the end, it’s a file list + filtering
@noahprince8: I think it’s both, really. The large amount of data itself becomes a problem because you can’t store it on normal hard discs. Needs to be in a distributed fs. With that large of data, if your segments are 200-500 mb, the metatdata scale also becomes a problem.
@g.kishore: We talked about this rt
@g.kishore: With a segment per day
@g.kishore: We are really looking at 3600 segments for 10 years
@noahprince8: So you can either do the segment group, or just merge the segments. But either way you do have a long download time.
@noahprince8: Now, if you’re lucky the segment group itself causes a hit to a separate metadata API which narrows down the number of smaller segments you need to download.
@g.kishore: Yeah, let’s get there to know what the bottle neck is
@noahprince8: But again I keep asking myself if this isn’t trying to force Pinot into something it’s not. Are you better off having a bespoke solution for long term historical that involves an aggressively indexed metadata store of files in s3. And a query engine that reads directly from s3 (like most parquet implementations)
@jackie.jxt: Broker will prune based on logical segments, and server can map logical segments into physical segments
@jiatao: The intervalMP is the naive method, performance ratio here means how much better search tree method compared with naive. First experiment I fixed the the percentage of resulted segments, and increase the # of segments/table. ```100 segments/table: intervalMP-> 0.0374 milli seconds, intervalST-> 0.00539 milli seconds, 6.9 performance ratio, average segments/table: 100.0, average result size: 8.1, result size percentage 8.1 1000 segments/table: intervalMP-> 0.0767 milli seconds, intervalST-> 0.0093 milli seconds, 8.23 performance ratio, average segments/table: 1000.0, average result size: 52.0, result size percentage 5.2 10000 segments/table: intervalMP-> 0.358 milli seconds, intervalST-> 0.056 milli seconds, 6.3 performance ratio, average segments/table: 10000.0, average result size: 501.3, result size percentage 5 100000 segments/table: intervalMP-> 6.16 milli seconds, intervalST-> 0.86 milli seconds, 7.15 performance ratio, average segments/table: 100000.0, average result size: 5001.0, result size percentage 5 10000000 segments/table: intervalMP-> 636.1 milli seconds, intervalST-> 230.5 milli seconds, 2.75 performance ratio, average segments/table: 1.0E7, average result size: 500001.6, result size percentage 5```
@jiatao: The second experiment, I fixed the # segments/table, and decreases the percentages of resulting segments ```10000000 segments/table: intervalMP-> 1237.3 milli seconds, intervalST-> 432.7 milli seconds, 2.86performance ratio, average segments/table: 1.0E7, average result size: 5000001.1, result size percentage 50.0 10000000 segments/table: intervalMP-> 697.3 milli seconds, intervalST-> 224.6 milli seconds, 3.1 performance ratio, average segments/table: 1.0E7, average result size: 500001.3, result size percentage 5.0 10000000 segments/table: intervalMP-> 593.4 milli seconds, intervalST-> 18.97 milli seconds, 31.27 performance ratio, average segments/table: 1.0E7, average result size: 50001.0, result size percentage 0.5 10000000 segments/table: intervalMP-> 585.3 milli seconds, intervalST-> 3.18 milli seconds, 183.97 performance ratio, average segments/table: 1.0E7, average result size: 5002.1, result size percentage 0.05 10000000 segments/table: intervalMP-> 582.2 milli seconds, intervalST-> 0.48 milli seconds, 1203.15 performance ratio, average segments/table: 1.0E7, average result size: 501.3, result size percentage 0.0050```
@jiatao: If the resulted # of segments are much smaller than # of all segments, interval search tree performs much better.
@g.kishore: great to see this level of details :clap::clap:
@jackie.jxt: ```10000000 segments/table: intervalMP-> 1237.3 milli seconds, intervalST-> 432.7 milli seconds, 2.86performance ratio, average segments/table: 1.0E7, average result size: 5000001.1, result size percentage 50.0``` Really? Selecting 50% of the segments, the naive one is 2.86 times the tree solution?
@jiatao: I used Concurrent hash map for naive solution
@jiatao: So it may not perform as good as a normal HashMap.
@jackie.jxt: Anyway, I think if we select less segments, the tree solution is definitely much faster
@jackie.jxt: Good job
@jackie.jxt: Based on this experiment, in order to handle large number of segments, we might also want to optimize the partition based pruner to pre-group the segments under each partition
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
