Uh. Many thanksWill try it out
On 17 Jan 2017 6:47 am, "Palash Gupta" wrote:
> Hi Marco,
>
> What is the user and password you are using for mongodb connection? Did
> you enable authorization?
>
> Better to include user & pass in mongo url.
>
> I remember I tested
Hi,
Example:
dframe =
sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("spark.mongodb.input.uri",
"
mongodb://user:pass@172.26.7.192:27017/db_name.collection_name").load()dframe.printSchema()
One more thing if you create one db in mongo, please create a collection with a
Hi Marco,
What is the user and password you are using for mongodb connection? Did you
enable authorization?
Better to include user & pass in mongo url.
I remember I tested with python successfully.
Best Regards,Palash
Sent from Yahoo Mail on Android
On Tue, 17 Jan, 2017 at 5:37 am, Marco
Hi,
I am trying to group by data in spark and find out maximum value for group
of data. I have to use group by as I need to transpose based on the values.
I tried repartition data by increasing number from 1 to 1.Job gets run
till the below stage and it takes long time to move ahead. I was
Hello,
I have following services are configured and installed successfully:
Hadoop 2.7.x
Spark 2.0.x
HBase 1.2.4
Hive 1.2.1
*Installation Directories:*
/usr/local/hadoop
/usr/local/spark
/usr/local/hbase
*Hive Environment variables:*
#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
Hello Spark Folks,
Other weird experience i have with Spark with SqlContext is when i created
Dataframe sometime this error throws exception and sometime not !
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val stdDf =
i am experiencing a ScalaReflectionException exception when doing an
aggregation on a spark-sql DataFrame. the error looks like this:
Exception in thread "main" scala.ScalaReflectionException: class
in JavaMirror with sun.misc.Launcher$AppClassLoader@28d93b30 of type class
Hi,
The coalesce does not automatically happen now and you need to control the
number for yourself.
Basically, #partitions respect a `spark.default.parallelism` number, by
default, #cores for your computer.
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
// maropu
On
Hi,
It seems you hit this issue:
https://issues.apache.org/jira/browse/SPARK-18020
// maropu
On Tue, Jan 17, 2017 at 11:51 AM, noppanit wrote:
> I'm totally new to Spark and I'm trying to learn from the example. I'm
> following this example
>
>
Hello List,
I was wondering what is the design principle that partition size of
an RDD is inherited from the parent. See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would
I'm totally new to Spark and I'm trying to learn from the example. I'm
following this example
https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala.
It works well. But I do have one question. Every time I
hi,all
I have a test with spark 2.0:
I have a test file: field delimiter with \t
kevin 30 2016
shen 30 2016
kai 33 2016
wei 30 2016
after useing:
var datas: RDD[(LongWritable, String)] =
sc.newAPIHadoopFile(inputPath+filename, classOf[TextInputFormat],
classOf[LongWritable],
Hi Shawn,
Could we do this as below?
for any of true
scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: double]
scala> df.filter(_.toSeq.exists(v => v == 1)).show()
+---+---+
| a| b|
+---+---+
| 1|0.5|
| 2|1.0|
Hello List,
I was wondering what is the design principle that partition size of
an RDD is inherited from the parent. See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would
Hello,
It seems that metadata is not propagating when using Dataset.map(). Is
there a workaround?
Below are the steps to reproduce:
import spark.implicits._
val columnName = "col1"
val meta = new MetadataBuilder().putString("foo", "bar").build()
val schema =
hi all
i have the folllowign snippet which loads a dataframe from a csv file and
tries to save
it to mongodb.
For some reason, the MongoSpark.save method raises the following exception
Exception in thread "main" java.lang.IllegalArgumentException: Missing
database name. Set via the
Hello,
It seems that metadata is not propagating when using Dataset.map(). Is there
a workaround?
Below are the steps to reproduce:
import spark.implicits._
val columnName = "col1"
val meta = new MetadataBuilder().putString("foo", "bar").build()
val schema =
Hi
For my use case, I need to call a third party function(which is in memory
based) for each complete partition data. So I am partitioning RDD logically
using repartition on index column and applying function f on
mapPartitions(f).
When, I iterate through mapPartition iterator. Can, I assume
I need to filter out outliers from a dataframe by all columns. I can
manually list all columns like:
df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0))
.filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1
))
...
But I want to turn it into a
sorry. should have done more research before jumping to the list
the version of the connector is 2.0.0, available from maven repors
sorry
On Mon, Jan 16, 2017 at 9:32 PM, Marco Mistroni wrote:
> HI all
> in searching on how to use Spark 2.0 with mongo i came across this
HI all
in searching on how to use Spark 2.0 with mongo i came across this link
https://jira.mongodb.org/browse/SPARK-20
i amended my build.sbt (content below), however the mongodb dependency was
not found
Could anyone assist?
kr
marco
name := "SparkExamples"
version := "1.0"
scalaVersion :=
On Sun, Jan 15, 2017 at 11:09 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:
> use yarn :)
>
> "spark-submit --master yarn"
>
Doesn't this require first copying out various Hadoop configuration XML
files from the EMR master node to the machine running the spark-submit? Or
is there a
Hi Pradeep,
That is a good idea. My customized RDDs are similar to the NewHadoopRDD. If
we have billions of InputSplit, will it be bottlenecked for the
performance? That is, will too many data need to be transferred from master
node to computing nodes by networking?
Thanks,
Fei
On Mon, Jan 16,
groupByKey() is a wide dependency and will cause a full shuffle. It's
advised against using this transformation unless you keys are balanced
(well-distributed) and you need a full shuffle.
Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on
the output). These actions are
Hello,
I checked the log file on the worker node and don't see any error there.
This is the first time I am asked to run on such a small cluster. I feel
its the resources issue, but it will be great help is somebody can confirm
this or share your experience. Thanks
On Sat, Jan 14, 2017 at 4:01
You may want to pull up release/1.2 branch and 1.2.0 tag to build it
yourself incase the packages are not available.
On Jan 15, 2017 2:55 PM, "Md. Rezaul Karim"
wrote:
> Hi Ayan,
>
> Thanks a million.
>
> Regards,
> _
> *Md. Rezaul
UhmNot a SPK issueAnyway...Had similar issues with sbt
The quick sol. To get u going is to place ur dependency in your lib folder
The notsoquick is to build the sbt dependency and do a sbt publish-local,
or deploy local
But I consider both approaches hacks.
Hth
On 16 Jan 2017 2:00
Hi,
Does groupByKey has intelligence associated with it, such that if all the
keys resides in the same partition, it should not do the shuffle?
Or user should write mapPartitions( scala groupBy code).
Which would be more efficient and what are the memory considerations?
Thanks
Hi all,
I have this project:
https://github.com/oleber/aws-stepfunctions
I have a second project that should import the first one. On the second
project I did something like:
lazy val awsStepFunctions = RootProject(uri("git://
Avro sink --> Spark Streaming
2017-01-16 13:55 GMT+01:00 ayan guha :
> With Flume, what would be your sink?
>
>
>
> On Mon, Jan 16, 2017 at 10:44 PM, Guillermo Ortiz
> wrote:
>
>> I'm wondering to use Flume (channel file)-Spark Streaming.
>>
>> I have
With Flume, what would be your sink?
On Mon, Jan 16, 2017 at 10:44 PM, Guillermo Ortiz
wrote:
> I'm wondering to use Flume (channel file)-Spark Streaming.
>
> I have some doubts about it:
>
> 1.The RDD size is all data what it comes in a microbatch which you have
>
I'm wondering to use Flume (channel file)-Spark Streaming.
I have some doubts about it:
1.The RDD size is all data what it comes in a microbatch which you have
defined. Risght?
2.If there are 2Gb of data, how many are RDDs generated? just one and I
have to make a repartition?
3.When is the ACK
The JIRA for this is here: https://issues.apache.org/jira/browse/SPARK-15784
There is a PR open already for it, which still needs to be reviewed.
On Wed, 21 Dec 2016 at 18:01 Robert Hamilton
wrote:
> Thank you Nick that is good to know.
>
> Would this have some
Hi Ayan,
Although my first reaction was "Why would anyone ever want to download
older versions" after a brief thinking about it I concluded that there
might be a merit having it.
Could you please file an issue in the issue tracker ->
https://issues.apache.org/jira/browse/SPARK?
Pozdrawiam,
Hi,
I use Spark 2.0.2 and want to do the following:
I extract features in a streaming job and than apply the records to a
k-means model. Some of the features are simple ones which are calculated
directly from the record. But I also have more complex features which
depend on records from a
Hi,
My team just do a archive on last year’s parquet files. I wonder whether the
filter push down optimization still work when I read data through
“har:///path/to/data/“? THX.
Best,
-
To unsubscribe e-mail:
36 matches
Mail list logo