in ALS, I guess all the iteration’s rdds are referenced by its next
iteration’s rdd, so all the shuffle data will not be deleted until the als job
finished…
I guess checkpoint could solve my problem, do you know checkpoint?
在 2015年3月3日,下午4:18,nitin [via Apache Spark User List]
Can you provide the detailed failure call stack?
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 3, 2015 3:52 PM
To: user@spark.apache.org
Subject: Supporting Hive features in Spark SQL Thrift JDBC server
Hi,
According to Spark SQL documentation, Spark SQL supports the
Hi Yi,
Thanks for your reply.
1. The version of spark is 1.2.0 and the version of hive is 0.10.0-cdh4.2.1.
2. The full trace stack of the exception:
15/03/03 13:41:30 INFO Client:
client token:
DUrrav1rAADCnhQzX_Ic6CMnfqcW2NIxra5n8824CRFZQVJOX0NMSUVOVF9UT0tFTgA
diagnostics: User
Shuffle write will be cleaned if it is not referenced by any object
directly/indirectly. There is a garbage collector written inside spark which
periodically checks for weak references to RDDs/shuffle write/broadcast and
deletes them.
--
View this message in context:
I'm trying to execute a query with Spark.
(Example from the Spark Documentation)
val teenagers = people.where('age = 10).where('age = 19).select('name)
Is it possible to execute an OR with this syntax?
val teenagers = people.where('age = 10 'or 'age = 4).where('age =
19).select('name)
I have
Hi,
Could you please let me know how to do this? (or) Any suggestion
Regards,
Rajesh
On Mon, Mar 2, 2015 at 4:47 PM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi,
I have a below edge list. How to find the parents path for every vertex?
Example :
Vertex 1 path : 2, 3, 4, 5, 6
Using where('age =10 'age =4) instead.
-Original Message-
From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Tuesday, March 3, 2015 5:14 PM
To: user
Subject: SparkSQL, executing an OR
I'm trying to execute a query with Spark.
(Example from the Spark Documentation)
val teenagers
Hive UDF are only applicable for HiveContext and its subclass instance, is the
CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext?
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 3, 2015 5:10 PM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re:
Here is my current implementation with current master version of spark
*class DeepCNNFeature extends Transformer with HasInputCol with
HasOutputCol ... { override def transformSchema(...) { ... }*
*override def transform(dataSet: DataFrame, paramMap: ParamMap):
DataFrame = {*
*
I use LATERAL VIEW explode(...) to read data from a parquet-file but the
full schema is requeseted by parquet instead just the used columns. When I
didn't use LATERAL VIEW the requested schema has just the two columns which
I use. Is it correct or is there place for an optimization or do I
Using the SchemaRDD / DataFrame API via HiveContext
Assume you're using the latest code, something probably like:
val hc = new HiveContext(sc)
import hc.implicits._
existedRdd.toDF().insertInto(hivetable)
or
existedRdd.toDF().registerTempTable(mydata)
hc.sql(insert into hivetable as select xxx
I am also starting to work on this one. Did you get any solution to this
issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Robin,
Thank you for your response. Please find below my question. I have a below
edge file
Source Vertex Destination Vertex 1 2 2 3 3 4 4 5 5 6 6 6
In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is
connected to 3rd vertex,. 6th vertex is connected to 6th vertex.
Thank you Arush, I've implemented initial data for a windowed operation and
opened a pull request here:
https://github.com/apache/spark/pull/4875
On Tue, Feb 24, 2015 at 4:49 AM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
I think this could be of some help to you.
Hi,
How can I insert an existing hive table with an RDD containing my data?
Any examples?
Best,
Patcharee
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
You need to increase the parallelism/repartition the data to a higher
number to get ride of those.
Thanks
Best Regards
On Tue, Mar 3, 2015 at 2:26 PM, lisendong lisend...@163.com wrote:
why does the gc time so long?
i 'm using als in mllib, while the garbage collection time is too long
Hi, is there a paper or a document where one can read how Spark reads Cassandra
data in parallel? And how it writes data back from RDDs? Its a bit hard to have
a clear picture in mind.
Thank you,
Pavel Velikhov
On Mar 3, 2015, at 1:08 AM, Rumph, Frens Jan m...@frensjan.nl wrote:
Hi all,
Hi,
Operations are not very extensive, as this scenario is not always
reproducible.
One of the executor start behaving in this manner. For this particular
application, we are using 8 cores in one executors, and practically, 4
executors are launched on one machine.
This machine has good config
Hi,
Is there any relation between removing block manager of an executor and
marking that as lost?
In my setup,even after removing block manager ( after failing to do some
operation )...it is taking more than 20 mins, to mark that as lost executor.
Following are the logs:
*15/03/03 10:26:49
Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1
(versions that we support). Seems
https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently.
On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao hao.ch...@intel.com wrote:
The temp table in metastore can not be
Hi Sam:
Shouldn't you define the table schema? I had the same problem in Scala and
then I solved it defining the schema. I did this:
sqlContext.applySchema(dataRDD, tableSchema).registerTempTable(tableName)
Hope it helps.
On Mon, Jan 5, 2015 at 7:01 PM, Sam Flint sam.fl...@magnetic.com wrote:
And this SO post goes into details on the PRNG in Java
http://stackoverflow.com/questions/9907303/does-java-util-random-implementation-differ-between-jres-or-platforms
On 3 Mar 2015, at 16:15, Robin East robin.e...@xense.co.uk wrote:
This is more of a java/scala question than spark - it uses
I am trying to run below java class with yarn cluster, but it hangs in
accepted state . i don't see any error . Below is the class and command .
Any help is appreciated .
Thanks,
Abhi
bin/spark-submit --class com.mycompany.app.SimpleApp --master yarn-cluster
/home/hduser/my-app-1.0.jar
Hi,
What pseudo-random-number generator does scala.util.Random uses?
Hi,
I have tried below program using pergel API but I'm not able to get my
required output. I'm getting exactly reverse output which I'm expecting.
// Creating graph using above mail mentioned edgefile
val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,
Have you tried EdgeDirection.In?
On 3 Mar 2015, at 16:32, Robin East robin.e...@xense.co.uk wrote:
What about the following which can be run in spark shell:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexlist = Array((1L,One),
Hello Shahab,
I think CassandraAwareHiveContext
https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/org/apache/spark/sql/hive/CassandraAwareHiveContext.scala
in
Calliopee is what you are looking for. Create CAHC instance and you should
be able to run hive functions against
Hi ,
I have 2 questions -
1. I was trying to use Resource Manager UI for my SPARK application using
yarn cluster mode as I observed that spark UI does not work for
Yarn-cluster.
IS that correct or am I missing some setup?
This is more of a java/scala question than spark - it uses java.util.Random :
https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala
On 3 Mar 2015, at 15:08, Vijayasarathy Kannan kvi...@vt.edu wrote:
Hi,
What pseudo-random-number generator does scala.util.Random
What about the following which can be run in spark shell:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexlist = Array((1L,One), (2L,Two), (3L,Three),
(4L,Four),(5L,Five),(6L,Six))
val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to
@Cheng :My problem is that the connector I use to query Spark does not
support latest Hive (0.12, 0.13), But I need to perform Hive Queries on
data retrieved from Cassandra. I assumed that if I get data out of
cassandra in some way and register it as Temp table I would be able to
query it using
Thanks Rohit,
I am already using Calliope and quite happy with it, well done ! except the
fact that :
1- It seems that it does not support Hive 0.12 or higher, Am i right? for
example you can not use : current_time() UDF, or those new UDFs added in
hive 0.12 . Are they supported? Any plan for
thanks, it works.
2015-03-03 13:32 GMT+01:00 Cheng, Hao hao.ch...@intel.com:
Using where('age =10 'age =4) instead.
-Original Message-
From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Tuesday, March 3, 2015 5:14 PM
To: user
Subject: SparkSQL, executing an OR
I'm trying
the scala syntax for arrays is Array[T], not T[], so you want to use
something:
kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]])
kryo.register(classOf[Array[Short]])
nonetheless, the spark should take care of this itself. I'll look into it
later today.
On Mon, Mar 2, 2015
As the call stack shows, the mongodb connector is not compatible with the Spark
SQL Data Source interface. The latest Data Source API is changed since 1.2,
probably you need to confirm which spark version the MongoDB Connector build
against.
By the way, a well format call stack will be more
You are right , CassandraAwareSQLContext is subclass of SQL context.
But I did another experiment, I queried Cassandra using
CassandraAwareSQLContext,
then I registered the rdd as a temp table , next I tried to query it
using HiveContext, but it seems that hive context can not see the
registered
As long as I set the spark.local.dir to multiple disks, the job will
failed, the errors are as follow:
(if I set the spark.local.dir to only 1 dir, the job will succed...)
Exception in thread main org.apache.spark.SparkException: Job cancelled
because SparkContext was shut down
at
I did an experiment with Hive and SQL context , I queried Cassandra
using CassandraAwareSQLContext
(a custom SQL context from Calliope) , then I registered the rdd as a
temp table , next I tried to query it using HiveContext, but it seems that
hive context can not see the registered table suing
Sorry , for half email - here it is again in full
Hi ,
I have 2 questions -
1. I was trying to use Resource Manager UI for my SPARK application using
yarn cluster mode as I observed that spark UI does not work for
Yarn-cluster.
IS that correct or am I missing some setup?
2. when I click on
Dear all,
Is there a least square solver based on DistributedMatrix that we can use
out of the box in the current (or the master) version of spark ?
It seems that the only least square solver available in spark is private to
recommender package.
Cheers,
Jao
The Hive dependency comes from spark-hive.
It does work with Spark 1.1 we will have the 1.2 release later this month.
On Mar 3, 2015 8:49 AM, shahab shahab.mok...@gmail.com wrote:
Thanks Rohit,
I am already using Calliope and quite happy with it, well done ! except
the fact that :
1- It
Hello Ted, Some progress, now it seems that the spark job does get submitted,
In the spark web UI, I do see this under finished drivers. However, it seems
to not go past this step, JavaPairReceiverInputDStreamString, String
messages = KafkaUtils.createStream(jsc, localhost:2181, aa,
bq. spark UI does not work for Yarn-cluster.
Can you be a bit more specific on the error(s) you saw ?
What Spark release are you using ?
Cheers
On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote:
Sorry , for half email - here it is again in full
Hi ,
I have 2
Hi,
I got this error message:
15/03/03 10:22:41 ERROR OneForOneBlockFetcher: Failed while starting block
fetches
java.lang.RuntimeException: java.io.FileNotFoundException:
Hi Srini,
If you start the $SPARK_HOME/sbin/start-history-server, you should be able
to see the basic spark ui. You will not see the master, but you will be
able to see the rest as I recall. You also need to add an entry into the
spark-defaults.conf, something like this:
*## Make sure the host
bq. changing the address with internal to the external one , but still does
not work.
Not sure what happened.
For the time being, you can use yarn command line to pull container log
(put in your appId and container Id):
yarn logs -applicationId application_1386639398517_0007 -containerId
Hi,
I have a spark streaming application, running on a single node, consisting
mainly of map operations. I perform repartitioning to control the number of
CPU cores that I want to use. The code goes like this:
val ssc = new StreamingContext(sparkConf, Seconds(5))
val distFile =
Hi Ted,
I used s3://support.elasticmapreduce/spark/install-spark to install spark
on my EMR cluster. It is 1.2.0.
When I click on the link for history or logs it takes me to
http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
Failed to connect implies that the executor at that host died, please
check its logs as well.
On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Sorry that I forgot the subject.
And in the driver, I got many FetchFailedException. The error messages are
15/03/03
Spark applications shown in the RM's UI should have an Application
Master link when they're running. That takes you to the Spark UI for
that application where you can see all the information you're looking
for.
If you're running a history server and add
spark.yarn.historyServer.address to your
If you can use hadoop 2.6.0 binary, you can use s3a
s3a is being polished in the upcoming 2.7.0 release:
https://issues.apache.org/jira/browse/HADOOP-11571
Cheers
On Tue, Mar 3, 2015 at 9:44 AM, Ankur Srivastava ankur.srivast...@gmail.com
wrote:
Hi,
We recently upgraded to Spark 1.2.1 -
Thanks a lot Ted!!
On Tue, Mar 3, 2015 at 9:53 AM, Ted Yu yuzhih...@gmail.com wrote:
If you can use hadoop 2.6.0 binary, you can use s3a
s3a is being polished in the upcoming 2.7.0 release:
https://issues.apache.org/jira/browse/HADOOP-11571
Cheers
On Tue, Mar 3, 2015 at 9:44 AM, Ankur
There are couple of solvers that I've written that is part of the AMPLab
ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
interested in porting them I'd be happy to review it
Thanks
Shivaram
[1]
Sorry that I forgot the subject.
And in the driver, I got many FetchFailedException. The error messages are
15/03/03 10:34:32 WARN TaskSetManager: Lost task 31.0 in stage 2.2 (TID
7943, ): FetchFailed(BlockManagerId(86, , 43070), shuffleId=0,
mapId=24, reduceId=1220, message=
@Yin: sorry for my mistake, you are right it was added in 1.2, not 0.12.0 ,
my bad!
On Tue, Mar 3, 2015 at 6:47 PM, shahab shahab.mok...@gmail.com wrote:
Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually
running it on spark 1.1)
But do you mean that even HiveConext of
Hi Folks,
We are trying to run the following code from the spark shell in a CDH 5.3
cluster running on RHEL 5.8.
*spark-shell --master yarn --deploy-mode client --num-executors 15
--executor-cores 6 --executor-memory 12G *
*import org.apache.spark.mllib.recommendation.ALS *
*import
In Spark 1.2 you'll have to create a partitioned hive table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions
in order to read parquet data in this format. In Spark 1.3 the parquet
data source will auto discover partitions when they are laid out
Is that error actually occurring in LBFGS? It looks like it might be
happening before the data even gets to LBFGS. (Perhaps the outer join
you're trying to do is making the dataset size explode a bit.) Are you
able to call count() (or any RDD action) on the data before you pass it to
LBFGS?
On
I believe that this has been optimized
https://github.com/apache/spark/commit/2a36292534a1e9f7a501e88f69bfc3a09fb62cb3
in Spark 1.3.
On Tue, Mar 3, 2015 at 4:36 AM, matthes matthias.diekst...@web.de wrote:
I use LATERAL VIEW explode(...) to read data from a parquet-file but the
full schema is
Sorry I made a mistake. Please ignore my question.
On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
I performed repartitioning and everything went fine with respect to the
number of CPU cores being used (and respective times). However, I noticed
something very strange:
Ted,
If the application is running then the logs are not available. Plus what i
want to view is the details about the running app as in spark UI.
Do I have to open some ports or do some other setting changes?
On Tue, Mar 3, 2015 at 10:08 AM, Ted Yu yuzhih...@gmail.com wrote:
bq. changing the
The minimization problem you're describing in the email title also looks
like it could be solved using the RidgeRegression solver in MLlib, once you
transform your DistributedMatrix into an RDD[LabeledPoint].
On Tue, Mar 3, 2015 at 11:02 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu
You can use DStream.transform() to do any arbitrary RDD transformations on
the RDDs generated by a DStream.
val coalescedDStream = myDStream.transform { _.coalesce(...) }
On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa saiph.ka...@gmail.com wrote:
Sorry I made a mistake in my code. Please ignore
Will this recognize the hive partitions as well.
Example
insert into specific partition of hive ?
On Tue, Mar 3, 2015 at 11:42 PM, Cheng, Hao hao.ch...@intel.com wrote:
Using the SchemaRDD / DataFrame API via HiveContext
Assume you're using the latest code, something probably like:
val hc
As it says in the API docs
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD,
tables created with registerTempTable are local to the context that creates
them:
... The lifetime of this temporary table is tied to the SQLContext
Hi all,
i have been trying to setup a stream using a custom receiver that
would pick up data from twitter using follow function to listen just to some
users . I'd like to keep that stream
context running and dynamically change the custom receiver by adding ids of
users that I'd listen to .
I see. I think your best bet is to create the cnnModel on the master and
then serialize it to send to the workers. If it's big (1M or so), then you
can broadcast it and use the broadcast variable in the UDF. There is not a
great way to do something equivalent to mapPartitions with UDFs right
Do you have enough resource in your cluster? You can check your resource
manager to see the usage.
Thanks.
Zhan Zhang
On Mar 3, 2015, at 8:51 AM, abhi
abhishek...@gmail.commailto:abhishek...@gmail.com wrote:
I am trying to run below java class with yarn cluster, but it hangs in accepted
I did some experiments and it seems not. But I like to get confirmation (or
perhaps I missed something). If it does support, could u let me know how to
specify multiple folders? Thanks.
Senqiang
Also try 1.3.0-RC1 or the current master. ALS should performance much
better in 1.3. -Xiangrui
On Tue, Mar 3, 2015 at 1:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You need to increase the parallelism/repartition the data to a higher number
to get ride of those.
Thanks
Best Regards
In Yarn (Cluster or client), you can access the spark ui when the app is
running. After app is done, you can still access it, but need some extra setup
for history server.
Thanks.
Zhan Zhang
On Mar 3, 2015, at 10:08 AM, Ted Yu
yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote:
bq.
Sorry I made a mistake in my code. Please ignore my question number 2.
Different numbers of partitions give *the same* results!
On Tue, Mar 3, 2015 at 7:32 PM, Saiph Kappa saiph.ka...@gmail.com wrote:
Hi,
I have a spark streaming application, running on a single node, consisting
mainly of
ah!! I think I know what you mean. My job was just in accepted stage for
a long time as it was running a huge file.
But now that it is in running stage , I can see it . I can see it at post
9046 though instead of 4040 . But I can see it.
Thanks
-roni
On Tue, Mar 3, 2015 at 1:19 PM, Zhan Zhang
Hello
I started using spark. I am working with Word2VecModel. However I am not
able to save the trained model. Here is what I am doing:
inp = sc.textFile(/Users/mediratta/code/word2vec/trunk-d/sub-5).map(lambda
row: row.split( ))
word2vec = Word2Vec()
model = word2vec.fit(inp)
out =
Can you elaborate on what is this switchover time?
TD
On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com
wrote:
Hi
On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in
client mode, running a udp streaming application, I am noting around 2
We have installed hadoop cluster with hive and spark and the spark sql thrift
server is up and running without any problem.
Now we have set of applications need to use spark sql thrift server to query
some data.
Some of these applications are java applications and the others are PHP
libgfortran.x86_64 4.1.2-52.el5_8.1 comes with libgfortran.so.1 but
not libgfortran.so.3. JBLAS requires the latter. If you have root
access, you can try to install a newer version of libgfortran.
Otherwise, maybe you can try Spark 1.3, which doesn't use JBLAS in
ALS. -Xiangrui
On Tue, Mar 3,
Thanks Michael. I understand now.
best,
/Shahab
On Tue, Mar 3, 2015 at 9:38 PM, Michael Armbrust mich...@databricks.com
wrote:
As it says in the API docs
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD,
tables created with registerTempTable are local
Hmm... ok, previous errors are still block fetch errors.
15/03/03 10:22:40 ERROR RetryingBlockFetcher: Exception while beginning
fetch of 11 outstanding blocks
java.io.IOException: Failed to connect to host-/:55597
at
Hi, When I submit my spark job, I see the following runtime exception in the
log,
Exception in thread Thread-1 java.lang.NoClassDefFoundError:
org/apache/spark/streaming/kafka/KafkaUtils
at SparkHdfs.run(SparkHdfs.java:56)
Caused by: java.lang.ClassNotFoundException:
Dear all,
I found the below sample code can be printed out only in spark shell, but
when I moved them into my spark streaming application, nothing can be
printed out into system console. Can you explain why it happened? anything
related to new spark context? Thanks a lot!
val anotherPeopleRDD
Hi,
can you explain how you copied that into your *streaming* application?
Like, how do you issue the SQL, what data do you operate on, how do you
view the logs etc.?
Tobias
On Wed, Mar 4, 2015 at 8:55 AM, Cui Lin cui@hds.com wrote:
Dear all,
I found the below sample code can be
I am confused. Are you killing the 1st worker node to see whether the
system restarts the receiver on the second worker?
TD
On Tue, Mar 3, 2015 at 10:49 PM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
This is the time that it takes for the driver to start receiving data
once again,
If think it will be interesting to have the equivalents of mappartitions with
dataframe. There are many use cases where data are processed in batch. Another
example is a simple linear classifier Ax=b where A is the matrix of feature
vectors, x the model and b the output. Here again the product
SparkSQL supports JDBC/ODBC connectivity, so if that's the route you
needed/wanted to connect through you could do so via java/php apps. Havent
used either so cant speak to the developer experience, assume its pretty
good as would be preferred method for lots of third party enterprise
Not quite sure, but you can try increasing the spark.akka.threads, most
likely it can be a yarn related issue.
Thanks
Best Regards
On Tue, Mar 3, 2015 at 3:38 PM, twinkle sachdeva twinkle.sachd...@gmail.com
wrote:
Hi,
Operations are not very extensive, as this scenario is not always
This is the time that it takes for the driver to start receiving data once
again, from the 2nd worker, when the 1st worker, where streaming thread was
initially running, is shutdown.
Cheers,
[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
Nastooh Avessta
ENGINEER.SOFTWARE
Thanks Ted. Actually a follow up question. I need to read multiple HDFS files
into RDD. What I am doing now is: for each file I read them into a RDD. Then
later on I union all these RDDs into one RDD. I am not sure if it is the best
way to do it.
ThanksSenqiang
On Tuesday, March 3, 2015
Looking at FileInputFormat#listStatus():
// Whether we need to recursive look into the directory structure
boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false);
where:
public static final String INPUT_DIR_RECURSIVE =
mapreduce.input.fileinputformat.input.dir.recursive;
Thanks for the confirmation, Stephen.
On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch java...@gmail.com wrote:
Thanks, I was looking at an old version of FileInputFormat..
BEFORE setting the recursive config (
mapreduce.input.fileinputformat.input.dir.recursive)
scala
Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?
On Tue, Mar 3, 2015 at 4:21 PM, subscripti...@prismalytics.io
subscripti...@prismalytics.io wrote:
Hi Friends:
We noticed the following in 'pyspark' happens when
Hi,
On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote:
Do you have enough resource in your cluster? You can check your resource
manager to see the usage.
Yep, I can confirm that this is a very annoying issue. If there is not
enough memory or VCPUs available, your app
Thanks, I was looking at an old version of FileInputFormat..
BEFORE setting the recursive config (
mapreduce.input.fileinputformat.input.dir.recursive)
scala sc.textFile(dev/*).count
java.io.IOException: *Not a file*:
file:/shared/sparkup/dev/audit-release/blank_maven_build
The default is
Hi,
I am trying to run a simple select query on a table.
val restaurants=hiveCtx.hql(select * from TableName where column like
'%SomeString%' )
This gives an error as below:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: *, tree:
How do I solve this?
The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass)
TextInputFormat. Inside the logic does exist to do the recursive directory
reading - i.e. first detecting if an entry were a directory and if so then
descending:
for (FileStatus
I would recommend caching; if you can't persist, iterative algorithms will
not work well.
I don't think calling count on the dataset is problematic; every iteration
in LBFGS iterates over the whole dataset and does a lot more computation
than count().
It would be helpful to see some error
Hi Friends:
We noticed the following in 'pyspark' happens when running in
distributed Standalone Mode (MASTER=spark://vps00:7077),
but not in Local Mode (MASTER=local[n]).
See the following, particularly what is highlighted in *Red* (again the
problem only happens in Standalone Mode).
Any
Which version / distribution are you using? Please references this blog that
Felix C posted if you’re running on CDH.
http://eradiating.wordpress.com/2015/02/22/getting-hivecontext-to-work-in-cdh/
Or you may also need to download the datanucleus*.jar files try to add the
option of “--jars”
Yeah, I can call count before that and it works. Also I was over caching
tables but I removed those. Now there is no caching but it gets really slow
since it calculates my table RDD many times.
Also hacked the LBFGS code to pass the number of examples which I
calculated outside in a Spark SQL
Looking at scaladoc:
/** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */
def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]]
Your conclusion is confirmed.
On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid wrote:
I did some experiments and it seems
1 - 100 of 126 matches
Mail list logo