That's awesome Yan. I was considering Phoenix for SQL calls to HBase since
Cassandra supports CQL but HBase QL support was lacking. I will get back to
you as I start using it on our loads.
I am assuming the latencies won't be much different from accessing HBase
through tsdb asynchbase as that's
You need to find the bottleneck here, it could your network (if the data is
huge) or your producer code isn't pushing at 20k/s, If you are able to
produce at 20k/s then make sure you are able to receive at that rate (try
it without spark).
Thanks
Best Regards
On Sat, Jul 25, 2015 at 3:29 PM,
I’ve found two PRs (almost identical) for replacing mapReduceTriplets with
aggregateMessages:
https://github.com/apache/spark/pull/3782
https://github.com/apache/spark/pull/3883
First is closed by Dave’s suggestion, second is stale.
Also there is a PR for the new Pregel API, which is also closed.
Can you add your description of the problem as a comment to that ticket and
we'll make sure to test both cases and break it out if the root cause ends
up being different.
On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang justin.u...@gmail.com wrote:
Sweet! Does this cover DataFrame#rdd also using
Hi TD,
Thanks for the info. I have the scenario like this.
I am reading the data from kafka topic. Let's say kafka has 3 partitions
for the topic. In my streaming application, I would configure 3 receivers
with 1 thread each such that they would receive 3 dstreams (from 3
partitions of kafka
Hey All,
Just a friendly reminder that Aug 1st is the feature freeze for Spark
1.5, meaning major outstanding changes will need to land in the this
week.
After May 1st we'll package a release for testing and then go into the
normal triage process where bugs are prioritized and some smaller
Thanks Michal,
Just to share what I'm working on in a related topic. So a long time ago I
build SparkOnHBase and put it into Cloudera Labs in this link.
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
Also recently I am working on getting this into HBase core. It will
Oops, yes, I'm still messing with the repo on a daily basis.. fixed
On 28 July 2015 at 17:11, Ted Yu yuzhih...@gmail.com wrote:
I got a compilation error:
[INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling
[INFO] Compiling 18 source files to
Hi Ted, yes, cloudera blog and your code was my starting point - but I
needed something more spark-centric rather than on hbase. Basically doing a
lot of ad-hoc transformations with RDDs that were based on HBase tables and
then mutating them after series of iterative (bsp-like) steps.
On 28 July
Brilliant! Will check it out.
Cheers
Jules
--
The Best Ideas Are Simple
Jules Damji
Developer Relations Community Outreach
jda...@hortonworks.com
http://hortonworks.com
On 7/28/15, 8:59 AM, Michal Haris
michal.ha...@visualdna.commailto:michal.ha...@visualdna.com wrote:
Hi all, last couple
Cool, will revisit, is your latest code visible publicly somewhere ?
On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:
Yup you should be able to do that with the APIs that are going into HBase.
Let me know if you need to chat about the problem and how to implement it
with
Hi,
I noticed that ReceiverTrackerSuite is failing in master Jenkins build for
both hadoop profiles.
The failure seems to start with:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3104/
FYI
Hi all, last couple of months I've been working on a large graph analytics
and along the way have written from scratch a HBase-Spark integration as
none of the ones out there worked either in terms of scale or in the way
they integrated with the RDD interface. This week I have generalised it
into
I got a compilation error:
[INFO] /home/hbase/s-on-hbase/src/main/scala:-1: info: compiling
[INFO] Compiling 18 source files to /home/hbase/s-on-hbase/target/classes
at 1438099569598
[ERROR]
/home/hbase/s-on-hbase/src/main/scala/org/apache/spark/hbase/examples/simple/HBaseTableSimple.scala:36:
Hi all,
I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet
table after recreating it, we can reproduce the error as following:
```scala
// hc is an instance of HiveContext
hc.sql(select * from b).show() // this is ok and b is a parquet
table
val df =
// ping
do we have any signoff from the pyspark devs to submit a PR to publish to
PyPI?
On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman freeman.jer...@gmail.com
wrote:
Hey all, great discussion, just wanted to +1 that I see a lot of value in
steps that make it easier to use PySpark as an
Hi Imran,
Thanks for your reply. I have double-checked the code I ran to
generate an nxn matrix and nx1 vector for n = 2^27. There was
unfortunately a bug in it, where instead of having typed 134,217,728
for n = 2^27, I included a third '7' by mistake, making the size 10x
larger.
However, even
Yup you should be able to do that with the APIs that are going into HBase.
Let me know if you need to chat about the problem and how to implement it
with the HBase apis.
We have tried to cover any possible way to use HBase with Spark. Let us
know if we missed anything if we did we will add it.
Hello Devs,
I am investigating how matrix vector multiplication can scale for an
IndexedRowMatrix in mllib.linalg.distributed.
Currently, I am broadcasting the vector to be multiplied on the right.
The IndexedRowMatrix is stored across a cluster with up to 16 nodes,
each with 200 GB of memory.
Stuff that people are using is here.
https://github.com/cloudera-labs/SparkOnHBase
The stuff going into HBase is here
https://issues.apache.org/jira/browse/HBASE-13992
If you want to add things to the hbase ticket lets do it in another jira.
Like these jira
Sorry this is more correct
RDD and DStream Functions
1. BulkPut
2. BulkGet
3. BulkDelete
4. Foreach with connection
5. Map with connection
6. Distributed Scan
7. BulkLoad
DataFrame Functions
1. BulkPut
2. BulkGet
3. Foreach with connection
4. Map with connection
5. Distributed Scan
6. BulkLoad
I think we do support 0 arg UDFs:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165
How are you using UDFs?
On Tue, Jul 28, 2015 at 2:15 AM, Sachith Withana swsach...@gmail.com
wrote:
Hi all,
Currently I need to support custom
Thanks ted for pointing this out. CC to Ryan and TD
On Tue, Jul 28, 2015 at 8:25 AM, Ted Yu yuzhih...@gmail.com wrote:
Hi,
I noticed that ReceiverTrackerSuite is failing in master Jenkins build for
both hadoop profiles.
The failure seems to start with:
On 27 Jul 2015, at 16:42, Ulanov, Alexander alexander.ula...@hp.com wrote:
It seems that the mentioned two joins can be rewritten as one outer join
You're right. In fact, the outer join can be streamlined further using a
method from GraphOps:
g = g.joinVertices(messages)(vprog).cache()
Then,
hey all, i'm just back in from my wedding weekend (woot!) and am
working on figuring out what's happening w/the git timeouts for pull
request builds.
TL;DR: if your build fails due to a timeout, please retrigger your
builds. i know this isn't the BEST solution, but until we get some
stuff
btw, the directory perm issue was only happening on
amp-jenkins-worker-04 and -05. both of the broken dirs were
clobbered, so we won't be seeing any more of these again.
On Tue, Jul 28, 2015 at 12:28 PM, shane knapp skn...@berkeley.edu wrote:
++joshrosen
ok, i found out some of what's going
Hi,
Out of curiosity, I have tried to replace the dependence on bash by sh
in the different scripts to launch Spark daemons and jobs. So far,
most scripts work with sh, except bin/spark-class. The culprit is
the while loop that compose the final command by parsing the output of
launcher library.
++joshrosen
ok, i found out some of what's going on. some builds were failing as such:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console
note that it's unable to remove the target/ directory during the
build... this is caused by 'git clean -fdx' running, and deep
On Tue, Jul 28, 2015 at 12:13 PM, Félix-Antoine Fortin
felix-antoine.for...@calculquebec.ca wrote:
The while loop cannot be executed with sh, while the single line can
be. Since on my system, sh is simply a link on bash, with some options
activated, I guess this simply means that the while
Thanks Sean. Very helpful!
On Tue, Jul 28, 2015 at 1:49 PM, Sean Owen so...@cloudera.com wrote:
You only need to rebase if your branch/PR now conflicts with master.
you don't need to squash since the merge script will do that in the
end for you. You can squash commits and force-push if you
Hi Mike,
are you sure there the size isn't off 2x somehow? I just tried to
reproduce with a simple test in BlockManagerSuite:
test(large block) {
store = makeBlockManager(4e9.toLong)
val arr = new Array[Double](1 28)
println(arr.size)
val blockId = BlockId(rdd_3_10)
val result =
Thanks for bringing this up! I talked with Michael Armbrust, and it sounds
like this is a from a bug in DataFrame caching:
https://issues.apache.org/jira/browse/SPARK-9141
It's marked as a blocker for 1.5.
Joseph
On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u...@gmail.com wrote:
Hey
git caches are set up on all workers for the pull request builder, and
builds are building w/the cache... however in the build logs it
doesn't seem to be actually *hitting* the cache, so i guess i'll be
doing some more poking and prodding to see wtf is going on.
On Tue, Jul 28, 2015 at 12:49
I am planning to update my PR to incorporate comments from reviewers.
Do I need to rebase/squash the commits into a single one?
Thanks!
-MW
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands,
You only need to rebase if your branch/PR now conflicts with master.
you don't need to squash since the merge script will do that in the
end for you. You can squash commits and force-push if you think it
would help clean up your intent, but, often it's clearer to leave the
review and commit
Sweet! Does this cover DataFrame#rdd also using the cached query from
DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a
derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A,
not whether the rdd from A.rdd or B.rdd uses the cached query of A.
On Tue,
36 matches
Mail list logo