Re: [Spark SQL] off-heap columnar store

Evan Chan Tue, 02 Sep 2014 22:26:11 -0700

On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell <[email protected]> wrote:
> I'm not sure what you mean here? Parquet is at its core just a format, you
> could store that data anywhere.
>
> Though it sounds like you saying, correct me if i'm wrong: you basically
> want a columnar abstraction layer where you can provide a different backing
> implementation to keep the columns rather than parquet-mr?
>
> I.e. you want to be able to produce a schema RDD from something like
> vertica, where updates should act as a write through cache back to vertica
> itself?

Something like that.

I'd like,

1)  An API to produce a schema RDD from an RDD of columns, not rows.
  However, an RDD[Column] would not make sense, since it would be
spread out across partitions.  Perhaps what is needed is a
Seq[RDD[ColumnSegment]].    The idea is that each RDD would hold the
segments for one column.  The segments represent a range of rows.
This would then read from something like Vertica or Cassandra.

2)  A variant of 1) where you could read this data from Tachyon.
Tachyon is supposed to support a columnar representation of data, it
did for Shark 0.9.x.

The goal is basically to load columnar data from something like
Cassandra into Tachyon, with the compression ratio of columnar
storage, and the speed of InMemoryColumnarTableScan.   If data is
appended into the Tachyon representation, be able to cache it back.
The write back is not as high a priority though.

A workaround would be to read data from Cassandra/Vertica/etc. and
write back into Parquet, but this would take a long time and incur
huge I/O overhead.

>
> I'm sorry it just sounds like its worth clearly defining what your key
> requirement/goal is.
>
>
> On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan <[email protected]> wrote:
>>
>> >
>> >> The reason I'm asking about the columnar compressed format is that
>> >> there are some problems for which Parquet is not practical.
>> >
>> >
>> > Can you elaborate?
>>
>> Sure.
>>
>> - Organization or co has no Hadoop, but significant investment in some
>> other NoSQL store.
>> - Need to efficiently add a new column to existing data
>> - Need to mark some existing rows as deleted or replace small bits of
>> existing data
>>
>> For these use cases, it would be much more efficient and practical if
>> we didn't have to take the origin of the data from the datastore,
>> convert it to Parquet first.  Doing so loses significant latency and
>> causes Ops headaches in having to maintain HDFS.     It would be great
>> to be able to load data directly into the columnar format, into the
>> InMemoryColumnarCache.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Spark SQL] off-heap columnar store

Reply via email to