Re: [Spark SQL] off-heap columnar store

2014-09-02 Thread Evan Chan
On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell i...@ianoconnell.com wrote:
 I'm not sure what you mean here? Parquet is at its core just a format, you
 could store that data anywhere.

 Though it sounds like you saying, correct me if i'm wrong: you basically
 want a columnar abstraction layer where you can provide a different backing
 implementation to keep the columns rather than parquet-mr?

 I.e. you want to be able to produce a schema RDD from something like
 vertica, where updates should act as a write through cache back to vertica
 itself?

Something like that.

I'd like,

1)  An API to produce a schema RDD from an RDD of columns, not rows.
  However, an RDD[Column] would not make sense, since it would be
spread out across partitions.  Perhaps what is needed is a
Seq[RDD[ColumnSegment]].The idea is that each RDD would hold the
segments for one column.  The segments represent a range of rows.
This would then read from something like Vertica or Cassandra.

2)  A variant of 1) where you could read this data from Tachyon.
Tachyon is supposed to support a columnar representation of data, it
did for Shark 0.9.x.

The goal is basically to load columnar data from something like
Cassandra into Tachyon, with the compression ratio of columnar
storage, and the speed of InMemoryColumnarTableScan.   If data is
appended into the Tachyon representation, be able to cache it back.
The write back is not as high a priority though.

A workaround would be to read data from Cassandra/Vertica/etc. and
write back into Parquet, but this would take a long time and incur
huge I/O overhead.


 I'm sorry it just sounds like its worth clearly defining what your key
 requirement/goal is.


 On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan velvia.git...@gmail.com wrote:

 
  The reason I'm asking about the columnar compressed format is that
  there are some problems for which Parquet is not practical.
 
 
  Can you elaborate?

 Sure.

 - Organization or co has no Hadoop, but significant investment in some
 other NoSQL store.
 - Need to efficiently add a new column to existing data
 - Need to mark some existing rows as deleted or replace small bits of
 existing data

 For these use cases, it would be much more efficient and practical if
 we didn't have to take the origin of the data from the datastore,
 convert it to Parquet first.  Doing so loses significant latency and
 causes Ops headaches in having to maintain HDFS. It would be great
 to be able to load data directly into the columnar format, into the
 InMemoryColumnarCache.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] off-heap columnar store

2014-08-29 Thread Evan Chan

 The reason I'm asking about the columnar compressed format is that
 there are some problems for which Parquet is not practical.


 Can you elaborate?

Sure.

- Organization or co has no Hadoop, but significant investment in some
other NoSQL store.
- Need to efficiently add a new column to existing data
- Need to mark some existing rows as deleted or replace small bits of
existing data

For these use cases, it would be much more efficient and practical if
we didn't have to take the origin of the data from the datastore,
convert it to Parquet first.  Doing so loses significant latency and
causes Ops headaches in having to maintain HDFS. It would be great
to be able to load data directly into the columnar format, into the
InMemoryColumnarCache.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Evan Chan
What would be the timeline for the parquet caching work?

The reason I'm asking about the columnar compressed format is that
there are some problems for which Parquet is not practical.

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
mich...@databricks.com wrote:
 What is the plan for getting Tachyon/off-heap support for the columnar
 compressed store?  It's not in 1.1 is it?


 It is not in 1.1 and there are not concrete plans for adding it at this
 point.  Currently, there is more engineering investment going into caching
 parquet data in Tachyon instead.  This approach is going to have much better
 support for nested data, leverages other work being done on parquet, and
 alleviates your concerns about wire format compatibility.

 That said, if someone really wants to try and implement it, I don't think it
 would be very hard.  The primary issue is going to be designing a clean
 interface that is not too tied to this one implementation.


 Also, how likely is the wire format for the columnar compressed data
 to change?  That would be a problem for write-through or persistence.


 We aren't making any guarantees at the moment that it won't change.  Its
 currently only intended for temporary caching of data.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Spark SQL] off-heap columnar store

2014-08-26 Thread Michael Armbrust

 Any initial proposal or design about the caching to Tachyon that you
 can share so far?


Caching parquet files in tachyon with saveAsParquetFile and then reading
them with parquetFile should already work. You can use SQL on these tables
by using registerTempTable.

Some of the general parquet work that we have been doing includes: #1935
https://github.com/apache/spark/pull/1935, SPARK-2721
https://issues.apache.org/jira/browse/SPARK-2721, SPARK-3036
https://issues.apache.org/jira/browse/SPARK-3036, SPARK-3037
https://issues.apache.org/jira/browse/SPARK-3037 and #1819
https://github.com/apache/spark/pull/1819

The reason I'm asking about the columnar compressed format is that
 there are some problems for which Parquet is not practical.


Can you elaborate?


Re: [Spark SQL] off-heap columnar store

2014-08-25 Thread Henry Saputra
Hi Michael,

This is great news.
Any initial proposal or design about the caching to Tachyon that you
can share so far?

I don't think there is a JIRA ticket open to track this feature yet.

- Henry

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
mich...@databricks.com wrote:

 What is the plan for getting Tachyon/off-heap support for the columnar
 compressed store?  It's not in 1.1 is it?


 It is not in 1.1 and there are not concrete plans for adding it at this
 point.  Currently, there is more engineering investment going into caching
 parquet data in Tachyon instead.  This approach is going to have much
 better support for nested data, leverages other work being done on parquet,
 and alleviates your concerns about wire format compatibility.

 That said, if someone really wants to try and implement it, I don't think
 it would be very hard.  The primary issue is going to be designing a clean
 interface that is not too tied to this one implementation.


 Also, how likely is the wire format for the columnar compressed data
 to change?  That would be a problem for write-through or persistence.


 We aren't making any guarantees at the moment that it won't change.  Its
 currently only intended for temporary caching of data.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[Spark SQL] off-heap columnar store

2014-08-22 Thread Evan Chan
Hey guys,

What is the plan for getting Tachyon/off-heap support for the columnar
compressed store?  It's not in 1.1 is it?

In particular:
 - being able to set TACHYON as the caching mode
 - loading of hot columns or all columns
 - write-through of columnar store data to HDFS or backing store
 - being able to start a context and query directly from Tachyon's
cached columnar data

I think most of this was in Shark 0.9.1.

Also, how likely is the wire format for the columnar compressed data
to change?  That would be a problem for write-through or persistence.

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org