Apache Pinot Daily Email Digest (2020-10-29)

Pinot Slack Email Digest Thu, 29 Oct 2020 19:01:02 -0700

#general

@xmuyoo: @xmuyoo has joined the channel
@mail460: @mail460 has joined the channel
@g.kishore: Passcode: 541532
@g.kishore: meetup happening now ^^
@azhar93: @azhar93 has joined the channel
@ravibabu.chikkam: is this meeting only for uber employees?
@mayanks: No
@mayanks: You are welcome to join
@ravibabu.chikkam: the link is not working
@ravibabu.chikkam: asking for sign in
@mayanks: Hmm, perhaps it is
@mayanks: Could you try signing in there, and join the event from there (may have to rsvp)
@mayanks:
@sree_at_chess: @sree_at_chess has joined the channel
@sree_at_chess: Hi
@gary.a.stafford: @gary.a.stafford has joined the channel

#random

@xmuyoo: @xmuyoo has joined the channel
@mail460: @mail460 has joined the channel
@azhar93: @azhar93 has joined the channel
@sree_at_chess: @sree_at_chess has joined the channel
@gary.a.stafford: @gary.a.stafford has joined the channel

#feat-presto-connector

@nguyenhoanglam1990: @nguyenhoanglam1990 has joined the channel

#troubleshooting

@elon.azoulay: Another migration question: our kafka cluster will be moving and offsets will be reset. How can we ensure that pinot keeps ingesting, is there a way to do this with no downtime?
@elon.azoulay: reset strategy is "earliest" but is there an api call to tell pinot to reset realtime ingestion?
@ssubrama: If offsets are reset I see no way to recover the realtime table. You will need to drop the realtime table and recreate it. At some point, the consumers will just be waitinfor events and see no more
@ssubrama: Offsets are expected to increase monotonically (not necessarily sequentially)
@elon.azoulay: Got it, this is really good to know!
@mayanks: How many tables do you have in your cluster?
@g.kishore: do you have offline table?
@elon.azoulay: Yes
@elon.azoulay: We have ~15 tables, all hybrid except 2 that are realtime only
@elon.azoulay: This is staging, so we can test it out
@elon.azoulay: I will see if we can set the offsets on the new cluster.
@elon.azoulay: Is it possible to update a realtime table def to change the kafka broker url with no issues?

#time-based-segment-pruner

@snlee: @snlee has joined the channel
@g.kishore: @g.kishore has joined the channel
@jiapengtao0: @jiapengtao0 has joined the channel
@noahprince8: @noahprince8 has joined the channel
@mayanks: @mayanks has joined the channel
@snlee: hello
@snlee: i created this ticket to discuss about the plan on time based segment pruner
@noahprince8: So a bit around our use case for this. We have a little over a petabyte of data. Pinot seems most commonly used with < 100TB. At that scale, you’re looking at millions of segments. And a large EBS volume bill. So part 1 of that is lazily fetching segments. Part 2 is ensuring that the broker prunes segments aggressively to minimize the number of segments lazily fetched in deep historical servers.
@snlee: So part 1& 2 can happen in parallel
@snlee: So to update on our side: • We are working on evaluating 2 interval search algorithms for the pruner (1. O(N) naive for each loop 2. O(m*logN) using interval search tree) • Once the study is done, we will start to implement the broker side pruner • For ETA, I will discuss with @jiapengtao0 and update
@noahprince8: From what I understand, segments are currently stored in Zookeeper. Not sure if zookeeper supports a time index, or can scale well at millions of segments.
@noahprince8: If you’re building a search tree from the segments in zookeeper, presumably you need to have all of the segments in memory
@noahprince8: That’s going to cause issues as segments scale
@snlee: @noahprince8 So, we fetch the segment metadata on `external view change` . This happens only if new segments are added or deleted (there are some other cases but in the happy path…). 1. we periodically fetch segment metadata info and update the info in memory <- this part can be expensive since it happens few times per day 2. we use in-memory info for segment pruning
@snlee: so, pinot’s scalability will bound to the information that we need to store in memory.
@snlee: if you have billions of segments, then we will eventually run out of memory.. but for millions, i think it’s doable.
@noahprince8: If you have a 500 TB table with 50mb segments, that would be 10,000,000 segments. I think that could cause OOMs in the broker holding this search tree. I’m also a little unfamiliar with the way helix’s message passing works. But I imagine state changes might also become an issue at this scale.
@noahprince8: Yeah, I’m curious what the memory footprint is of a million segments. We could also up the segment size to lower the count. Or potentially merge very old segments.
@noahprince8: But I’m wondering if we could instead delegate segment storage to something like influxdb with a time series index. It’s custom-built to hold this type of data in memory as long as it can, and flush to the disc when it needs to
@noahprince8: Or… honestly, what if we just make segment storage pluggable? Because for most use cases the in memory index will do. Then my company could write a custom plugin that offloads into influx for larger tables.
@snlee: I think that iceberge from Netflix is also doing the similar thing
@g.kishore: I dont really see why influxdb will be any better here
@noahprince8: Yeah. They do it with files stored in s3 I believe. Snowflake also does a similar kind of indexing of segments, and they use foundationdb
@noahprince8: @g.kishore influx just as an example, I haven’t actually used it. But generally something that can effectively index/query time indexed data. That uses a hybrid memory caching and disc based approach so we know it will scale with the volume.
@snlee: Pinot is also one type of storage + query engine solutions for index/querying time series data.
@noahprince8: Because Pinot is performing pretty well with a couple days of data. My main concern is that it will fall over when we introduce some really large data sets. And maybe Pinot isn’t the best solution for these datasets. I could be trying to shove a square peg in a round hole here.
@noahprince8: Ha using pinot to index pinot segments for large tables. That would be funny.
@noahprince8: Could even make a tree of pinots if your segment index gets too big so it needs its own segment index :smile:
@snlee: So do you need to serve one giant table? or you will have many of these?
@noahprince8: Many of these
@snlee: I think that we can roll out the feature in step. We will anyway make this pruner configurable per table.
@noahprince8: How are segments added? If you could have a pluggable segment adder and pruner that'd be cool
@snlee: adder?
@snlee: pruner is pluggable yes
@noahprince8: Yeah. I'm a bit unfamiliar with the architecture. But I assume there's an event bus with new segments
@snlee: ohoh
@snlee: so we put the zk watcher from the broker
@noahprince8: So if you could make it completely detachable so that you can abstract segment storage and querying
@snlee: and broker will be notified if there’s any update on the segment for teh table
@snlee: Our deep storage (or segment store) has been abstracted out.
@snlee: but our query execution is closely mingled with mmaped segments placed locally. At the Pinot server level, there’s no store & querying separation.
@snlee: yeah we can abstract out the segments instead of requiring it to place locally. but that will be a huge architectural change. also, i’m not sure if it’s possible to do mmap to a remote file
@noahprince8: Ah. That makes sense
@snlee: we can keep this as a separate discussion. I think that it’s an interesting topic since eventually, separated store & execution will fit better on cloud environment (if it’s possible to achieve for pinot without perf degradation)
@snlee: so, for supporting your use case,
@snlee: 1. I think that you can use larger segment size. 500MB segment size will work pretty well. That will reduce the # of segments by the factor of 10 2. We will first check in the in-memory based time based pruner. If you can verify from your use case, it will be great.
@noahprince8: Yeah, I’ll definitely test it with a few months of a large dataset
@snlee: sounds good
@snlee: we will update here when we have the update on the pr.
@noahprince8: Yeah, so for rough idea of scale, our largest dataset is around 3 TB a day (taken from our kafka input rate on a busy day. This is a compressed binary protocol). Mid size one is around 300 GB a day. Total we’re looking at around 10 TB a day. The 3 TB data set can be made more efficient to make it slightly smaller. But with 252 trading days in a year, that big dataset is going to be 756TB. So 12 million segments. Is it realistic to be trying to store something this large in Pinot? Or should we be using traditional data warehousing techniques like hive, iceberg files, etc.
@mayanks: Would you be able to provide a bit more info on the use case (eg content of data, what you want to query)?
@noahprince8: Yeah, so this is full depth of the books in an options exchange. Blows up pretty quickly because for every exchange, there are multiple underlying contracts. For each one of those, many futures at various expirations. For each future, multiple options contracts at various strike prices. For each one of those, there’s multiple levels of bid price/quantity, ask price/quantity listed from best to worst offer.
@mayanks: Ok. What kind of queries would you run?
@noahprince8: Plot the top of book for this particular options contract over this 2 minute interval Give me the average top of book for some contract over some time interval. Join this to some other dataset, like our theoretical price of an option and compare.
@mayanks: so queries limited to a particular contract, or can span across contracts?
@noahprince8: Most likely limited to a particular contract. Though ultimately I’m not sure how quants will use it. They may query multiple options for the same future. Or multiple options for the same underlying stock.
@noahprince8: Which is why I’m very interested in the bloom filtering. Because a bloom filter on underlying and expiry + an efficient time filter could cut 12 million segments down to 2
@mayanks: For queries limited to particular contract, partitioning on contract would prune all other contracts
@mayanks: Is your 3TB data size translating to 3TB of Pinot segments, or do you see compression?
@mayanks: We have cases with 100MB - 1GB segment size (per segment), so definitely increasing the segment size will help reduce broker memory usage
@noahprince8: I suspect you’d see some compression. I have not put that dataset into pinot, yet.
@mayanks: Ok, would be good to pick one file and see what the compression is
@noahprince8: Yeah. More so I want to prepare for millions of segments so we don’t go all in on pinot and hit scaling issues later.
@noahprince8: So we know something like iceberg files will probably work, but will be slower than pinot. Doing due diligence now to ferret out everywhere pinot might crack at scale. As is typical in data infra though, there’s a limited amount we can tell without actually _testing_ it. Let’s see how it works with the in-memory merge tree. I’ll work on the lazy loading from s3. Then I can spinup a testing cluster in AWS and backfill a few months of data.
@jiatao: @jiatao has joined the channel
@snlee: cool
@noahprince8: Sorry for being annoying with this haha. Will try to make up for it by contributing some code :smile:
@snlee: :+1:
@snlee: imo, if you can get to merge your files to become ~500MB to reduce the # segments to a single digit millions, in-memory pruning should be fine.
@noahprince8: Yeah, I think also we can create different tenants for the super large tables
@noahprince8: Then use presto or that rest thing that uber made to unify them
@noahprince8: We could also introduce another layer of tiering so that after x months, we create huge day-long segments.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]