Re: Sending large objects to specific RDDs

Ted Yu Fri, 15 Jan 2016 17:30:36 -0800

My knowledge of XSEDE is limited - I visited the website.

If there is no easy way to deploy HBase, alternative approach (using hdfs
?) needs to be considered.


I need to do more homework on this :-)

On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> Hi Ted,
>
> So unfortunately after looking into the cluster manager that I will be
> using for my testing (I'm using a super-computer called XSEDE rather than
> AWS), it looks like the cluster does not actually come with Hbase installed
> (this cluster is becoming somewhat problematic, as it is essentially AWS
> but you have to do your own virtualization scripts). Do you have any other
> thoughts on how I could go about dealing with this purely using spark and
> HDFS?
>
> Thank you
>
> On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <
> daniel.imber...@gmail.com> wrote:
>
>> Thank you Ted! That sounds like it would probably be the most efficient
>> (with the least overhead) way of handling this situation.
>>
>> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Another approach is to store the objects in NoSQL store such as HBase.
>>>
>>> Looking up object should be very fast.
>>>
>>> Cheers
>>>
>>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
>>>> I'm looking for a way to send structures to pre-determined partitions
>>>> so that
>>>> they can be used by another RDD in a mapPartition.
>>>>
>>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>>>> indexes. The inverted index objects are quite large.
>>>>
>>>> My hope is to do a MapPartitions within the RDD of vectors where I can
>>>> compare each vector to the inverted index. The issue is that I only
>>>> NEED one
>>>> inverted index object per partition (which would have the same key as
>>>> the
>>>> values within that partition).
>>>>
>>>>
>>>> val vectors:RDD[(Int, SparseVector)]
>>>>
>>>> val invertedIndexes:RDD[(Int, InvIndex)] =
>>>> a.reduceByKey(generateInvertedIndex)
>>>> vectors:RDD.mapPartitions{
>>>>     iter =>
>>>>          val invIndex = invertedIndexes(samePartitionKey)
>>>>          iter.map(invIndex.calculateSimilarity(_))
>>>>          )
>>>> }
>>>>
>>>> How could I go about setting up the Partition such that the specific
>>>> data
>>>> structure I need will be present for the mapPartition but I won't have
>>>> the
>>>> extra overhead of sending over all values (which would happen if I were
>>>> to
>>>> make a broadcast variable).
>>>>
>>>> One thought I have been having is to store the objects in HDFS but I'm
>>>> not
>>>> sure if that would be a suboptimal solution (It seems like it could slow
>>>> down the process a lot)
>>>>
>>>> Another thought I am currently exploring is whether there is some way I
>>>> can
>>>> create a custom Partition or Partitioner that could hold the data
>>>> structure
>>>> (Although that might get too complicated and become problematic)
>>>>
>>>> Any thoughts on how I could attack this issue would be highly
>>>> appreciated.
>>>>
>>>> thank you for your help!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>

Re: Sending large objects to specific RDDs

Reply via email to