Hi,
I am using newAPIHadoopFile to process large number of s3 files(around 20
thousand) by passing URLs as comma separated String. It take around *7
minutes* to start the job. I am running the job on EMR 5.2.0 with spark
2.0.2.
Here is the code
Configuration conf = new Configuration();
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Would it be more robust to use the Path when creating the FileSystem?
https://github.com/graphframes/graphframes/issues/160
On Thu, Jan 5, 2017 at 4:57 PM, Felix Cheung
wrote:
> This is likely a factor of your hadoop config and Spark rather then
> anything specific
This is likely a factor of your hadoop config and Spark rather then anything
specific with GraphFrames.
You might have better luck getting assistance if you could isolate the code to
a simple case that manifests the problem (without GraphFrames), and repost.
Adding DEV mailing list to see if this is a defect with ConnectedComponent
or if they can recommend any solution.
Thanks
Ankur
On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava wrote:
> Yes I did try it out and it choses the local file system as my checkpoint
>
Yes I did try it out and it choses the local file system as my checkpoint
location starts with s3n://
I am not sure how can I make it load the S3FileSystem.
On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung
wrote:
> Right, I'd agree, it seems to be only with delete.
>
>
Hello Experts,
I am trying to allow null values in numeric fields. Here are the details of
the issue I have:
http://stackoverflow.com/questions/41492344/spark-avro-to-parquet-writing-null-values-in-number-fields
I also tried making all columns nullable by using the below function (from
one of
Right, I'd agree, it seems to be only with delete.
Could you by chance run just the delete to see if it fails
FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)
From: Ankur Srivastava
Sent: Thursday, January 5,
Hi Steve,
Thanks for the reply and below is follow-up help needed from you.
Do you mean we can set up two native file system to single sparkcontext ,so
then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2)
will that identify and write/read appropriate cloud.
Is that my
So, it seems the only way I found for now is a recursive handling of the Row
instances directly, but to do that I have to go back to RDDs, i've put together
a simple test case demonstrating the problem :
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.scalatest.{FlatSpec,
Hi
might be off topic, but databricks has a web application in whicn you
can use spark with jupyter. have a look at
https://community.cloud.databricks.com
kr
On Thu, Jan 5, 2017 at 7:53 PM, Jon G wrote:
> I don't use MapR but I use pyspark with jupyter, and this MapR
I don't use MapR but I use pyspark with jupyter, and this MapR blogpost
looks similar to what I do to setup:
https://community.mapr.com/docs/DOC-1874-how-to-use-jupyter-pyspark-on-mapr
On Thu, Jan 5, 2017 at 3:05 AM, neil90 wrote:
> Assuming you don't have your
There is a way, you can use
org.apache.spark.sql.functions.monotonicallyIncreasingId it will give each rows
of your dataframe a unique Id
On Tue, Oct 18, 2016 10:36 AM, ayan guha guha.a...@gmail.com
wrote:
Do you have any primary key or unique identifier in your data? Even if multiple
This blog post(Not mine) has some nice examples -
https://hadoopist.wordpress.com/2016/08/19/how-to-create-compressed-output-files-in-spark-2-0/
>From the blog -
df.write.mode("overwrite").format("parquet").option("compression",
"none").mode("overwrite").save("/tmp/file_no_compression_parq")
Yes it works to read the vertices and edges data from S3 location and is
also able to write the checkpoint files to S3. It only fails when deleting
the data and that is because it tries to use the default file system. I
tried looking up how to update the default file system but could not find
On 5 Jan 2017, at 09:58, Manohar753
> wrote:
Hi All,
Using spark is interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
Hi User Team,
I'm trying to schedule resource in spark 2.1.0 using below code but still all
the cpu cores are captured by only single spark application and hence no other
application is starting. Could you please help me out:
sqlContext =
Why not do that with spark sql to utilise the executors properly, rather than a
sequential filter on the driver.
Select * from A left join B on A.fk = B.fk where B.pk is NULL limit k
If you were sorting just so you could iterate in order, this might save you a
couple of sorts too.
Hi all,
I am aware that collect will return a list aggregated on driver, this will
return OOM when we have a too big list.
Is toLocalIterator safe to use with very big list, i want to access all values
one by one.
Basically the goal is to compare two sorted rdds (A and B) to find top k
Hi Macro,
Yes it was in the same host when problem was found.
Even when I tried to start with different host, the problem is still there.
Any hints or suggestion will be appreciated.
Thanks & Best Regards,
Palash Gupta
From: Marco Mistroni
To: Palash Gupta
Hi All,
Using spark is interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
processing and finally needs to store in S3 and my spark engine runs on AWS
Cluster.
Please let me back is there any way for
Hi
If it only happens when u run 2 app at same time could it be that these 2
apps somehow run on same host?
Kr
On 5 Jan 2017 9:00 am, "Palash Gupta" wrote:
> Hi Marco and respected member,
>
> I have done all the possible things suggested by Forum but still I'm
>
Hi Marco and respected member,
I have done all the possible things suggested by Forum but still I'm having
same issue:
1. I will migrate my applications to production environment where I will have
more resourcesPalash>> I migrated my application in production where I have
more CPU Cores,
>From the stack it looks to be an error from the explicit call to
>hadoop.fs.FileSystem.
Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?
_
From: Ankur Srivastava
Hi Team,
Can some please share any examples on spark java read and write files from
Google Store.
Thanks You in advance.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-java-with-Google-Store-tp28276.html
Sent from the Apache Spark User List
Can you be more specific on what you would want to change on the DF level?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-Spark-Properties-on-Dataframes-tp28266p28275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Assuming you don't have your environment variables setup in your
.bash_profile you would do it like this -
import os
import sys
spark_home = '/usr/local/spark'
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home,
'python/lib/py4j-0.10.1-src.zip'))
Hi ,
I am using EMR machine and I could see the Spark log directory has grown
till 4G.
file name : spark-history-server.out
Need advise how can I reduce the the size of the above mentioned file.
Is there config property which can help me .
Thanks,
Divya
29 matches
Mail list logo