Re: [Spark SQL] off-heap columnar store
On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell i...@ianoconnell.com wrote: I'm not sure what you mean here? Parquet is at its core just a format, you could store that data anywhere. Though it sounds like you saying, correct me if i'm wrong: you basically want a columnar abstraction layer where you can provide a different backing implementation to keep the columns rather than parquet-mr? I.e. you want to be able to produce a schema RDD from something like vertica, where updates should act as a write through cache back to vertica itself? Something like that. I'd like, 1) An API to produce a schema RDD from an RDD of columns, not rows. However, an RDD[Column] would not make sense, since it would be spread out across partitions. Perhaps what is needed is a Seq[RDD[ColumnSegment]].The idea is that each RDD would hold the segments for one column. The segments represent a range of rows. This would then read from something like Vertica or Cassandra. 2) A variant of 1) where you could read this data from Tachyon. Tachyon is supposed to support a columnar representation of data, it did for Shark 0.9.x. The goal is basically to load columnar data from something like Cassandra into Tachyon, with the compression ratio of columnar storage, and the speed of InMemoryColumnarTableScan. If data is appended into the Tachyon representation, be able to cache it back. The write back is not as high a priority though. A workaround would be to read data from Cassandra/Vertica/etc. and write back into Parquet, but this would take a long time and incur huge I/O overhead. I'm sorry it just sounds like its worth clearly defining what your key requirement/goal is. On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan velvia.git...@gmail.com wrote: The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. Can you elaborate? Sure. - Organization or co has no Hadoop, but significant investment in some other NoSQL store. - Need to efficiently add a new column to existing data - Need to mark some existing rows as deleted or replace small bits of existing data For these use cases, it would be much more efficient and practical if we didn't have to take the origin of the data from the datastore, convert it to Parquet first. Doing so loses significant latency and causes Ops headaches in having to maintain HDFS. It would be great to be able to load data directly into the columnar format, into the InMemoryColumnarCache. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Spark SQL] off-heap columnar store
The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. Can you elaborate? Sure. - Organization or co has no Hadoop, but significant investment in some other NoSQL store. - Need to efficiently add a new column to existing data - Need to mark some existing rows as deleted or replace small bits of existing data For these use cases, it would be much more efficient and practical if we didn't have to take the origin of the data from the datastore, convert it to Parquet first. Doing so loses significant latency and causes Ops headaches in having to maintain HDFS. It would be great to be able to load data directly into the columnar format, into the InMemoryColumnarCache. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Spark SQL] off-heap columnar store
What would be the timeline for the parquet caching work? The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust mich...@databricks.com wrote: What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? It is not in 1.1 and there are not concrete plans for adding it at this point. Currently, there is more engineering investment going into caching parquet data in Tachyon instead. This approach is going to have much better support for nested data, leverages other work being done on parquet, and alleviates your concerns about wire format compatibility. That said, if someone really wants to try and implement it, I don't think it would be very hard. The primary issue is going to be designing a clean interface that is not too tied to this one implementation. Also, how likely is the wire format for the columnar compressed data to change? That would be a problem for write-through or persistence. We aren't making any guarantees at the moment that it won't change. Its currently only intended for temporary caching of data. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Spark SQL] off-heap columnar store
Any initial proposal or design about the caching to Tachyon that you can share so far? Caching parquet files in tachyon with saveAsParquetFile and then reading them with parquetFile should already work. You can use SQL on these tables by using registerTempTable. Some of the general parquet work that we have been doing includes: #1935 https://github.com/apache/spark/pull/1935, SPARK-2721 https://issues.apache.org/jira/browse/SPARK-2721, SPARK-3036 https://issues.apache.org/jira/browse/SPARK-3036, SPARK-3037 https://issues.apache.org/jira/browse/SPARK-3037 and #1819 https://github.com/apache/spark/pull/1819 The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. Can you elaborate?
Re: [Spark SQL] off-heap columnar store
Hi Michael, This is great news. Any initial proposal or design about the caching to Tachyon that you can share so far? I don't think there is a JIRA ticket open to track this feature yet. - Henry On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust mich...@databricks.com wrote: What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? It is not in 1.1 and there are not concrete plans for adding it at this point. Currently, there is more engineering investment going into caching parquet data in Tachyon instead. This approach is going to have much better support for nested data, leverages other work being done on parquet, and alleviates your concerns about wire format compatibility. That said, if someone really wants to try and implement it, I don't think it would be very hard. The primary issue is going to be designing a clean interface that is not too tied to this one implementation. Also, how likely is the wire format for the columnar compressed data to change? That would be a problem for write-through or persistence. We aren't making any guarantees at the moment that it won't change. Its currently only intended for temporary caching of data. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[Spark SQL] off-heap columnar store
Hey guys, What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? In particular: - being able to set TACHYON as the caching mode - loading of hot columns or all columns - write-through of columnar store data to HDFS or backing store - being able to start a context and query directly from Tachyon's cached columnar data I think most of this was in Shark 0.9.1. Also, how likely is the wire format for the columnar compressed data to change? That would be a problem for write-through or persistence. thanks, Evan - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org