[
https://issues.apache.org/jira/browse/SPARK-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-16614.
----------------------------------
Resolution: Incomplete
> DirectJoin with DataSource for SparkSQL
> ---------------------------------------
>
> Key: SPARK-16614
> URL: https://issues.apache.org/jira/browse/SPARK-16614
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Russell Spitzer
> Priority: Major
> Labels: bulk-closed
>
> Join behaviors against some datasources can be improved by skipping a full
> scan and instead performing a series of point lookups.
> An example
> {code}DataFrame A contains { key1, key5, key302, ... key 50923423}
> DataFrame B is a source reading from a C* database with keys {key1, key2,
> key3 ....}
> a.join(b){code}
> Currently this will cause the entirety of the DataFrame B to be read into
> memory before performing a Join. Instead it would be useful if we could
> expose another api, {{DirectJoinSource}} which allowed connectors to provide
> a means of requesting a non-contiguous subset of keys from a DataSource.
> This kind of lookup would behave like the joinWithCasandraTable call in the
> Spark Cassandra Connector
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable.
>
> We find that this is much more useful when the end user is requesting only a
> small portion of well defined records. I believe this could be applicable to
> a variety of datasources where reading the entire source is inefficient
> compared to point lookups.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]