My knowledge of XSEDE is limited - I visited the website.

If there is no easy way to deploy HBase, alternative approach (using hdfs
?) needs to be considered.

I need to do more homework on this :-)

On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman <>

> Hi Ted,
> So unfortunately after looking into the cluster manager that I will be
> using for my testing (I'm using a super-computer called XSEDE rather than
> AWS), it looks like the cluster does not actually come with Hbase installed
> (this cluster is becoming somewhat problematic, as it is essentially AWS
> but you have to do your own virtualization scripts). Do you have any other
> thoughts on how I could go about dealing with this purely using spark and
> Thank you
> On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman <
>> wrote:
>> Thank you Ted! That sounds like it would probably be the most efficient
>> (with the least overhead) way of handling this situation.
>> On Wed, Jan 13, 2016 at 11:36 AM Ted Yu <> wrote:
>>> Another approach is to store the objects in NoSQL store such as HBase.
>>> Looking up object should be very fast.
>>> Cheers
>>> On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman <
>>>> wrote:
>>>> I'm looking for a way to send structures to pre-determined partitions
>>>> so that
>>>> they can be used by another RDD in a mapPartition.
>>>> Essentially I'm given and RDD of SparseVectors and an RDD of inverted
>>>> indexes. The inverted index objects are quite large.
>>>> My hope is to do a MapPartitions within the RDD of vectors where I can
>>>> compare each vector to the inverted index. The issue is that I only
>>>> NEED one
>>>> inverted index object per partition (which would have the same key as
>>>> the
>>>> values within that partition).
>>>> val vectors:RDD[(Int, SparseVector)]
>>>> val invertedIndexes:RDD[(Int, InvIndex)] =
>>>> a.reduceByKey(generateInvertedIndex)
>>>> vectors:RDD.mapPartitions{
>>>>     iter =>
>>>>          val invIndex = invertedIndexes(samePartitionKey)
>>>>          )
>>>> }
>>>> How could I go about setting up the Partition such that the specific
>>>> data
>>>> structure I need will be present for the mapPartition but I won't have
>>>> the
>>>> extra overhead of sending over all values (which would happen if I were
>>>> to
>>>> make a broadcast variable).
>>>> One thought I have been having is to store the objects in HDFS but I'm
>>>> not
>>>> sure if that would be a suboptimal solution (It seems like it could slow
>>>> down the process a lot)
>>>> Another thought I am currently exploring is whether there is some way I
>>>> can
>>>> create a custom Partition or Partitioner that could hold the data
>>>> structure
>>>> (Although that might get too complicated and become problematic)
>>>> Any thoughts on how I could attack this issue would be highly
>>>> appreciated.
>>>> thank you for your help!
>>>> --
>>>> View this message in context:
>>>> Sent from the Apache Spark User List mailing list archive at
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:

Reply via email to