Hi Chanh,
I found a workaround that works to me:
http://stackoverflow.com/questions/29552799/spark-unable-to-find-jdbc-driver/40114125#40114125
Regards,
Daniel
El jue., 6 oct. 2016 a las 6:26, Chanh Le () escribió:
> Hi everyone,
> I have the same config in both mode and I really want to change
It's a bit less concise but this works:
> a <- as.DataFrame(cars)
> head(a)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> b <- withColumn(a, "speed", ifelse(a$speed > 15, a$speed, 3))
> head(b)
speed dist
1 3 2
2 3 10
3 3 4
4 3 22
5 3 16
6 3 10
I think your example could be something
Thank you Daniel,
Actually I tried this before but this way is still not flexible way if you are
running multiple jobs at the time and may different dependencies between each
job configuration so I gave up.
Another simple solution is set the command bellow as a service and I am using
it.
> /bu
while using Deep Learning you might want to stay as close to tensorflow as
possible. There is very less translation loss, you get to access stable,
scalable and tested libraries from the best brains in the industry and as
far as Scala goes, it helps a lot to think about using the language as a
tool
On 18 Oct 2016, at 10:58, Chetan Khatri
mailto:ckhatriman...@gmail.com>> wrote:
Dear Xi shen,
Thank you for getting back to question.
The approach i am following are as below:
I have MSSQL server as Enterprise data lack.
1. run Java jobs and generated JSON files, every file is almost 6 GB.
Co
On 19 Oct 2016, at 00:18, Michael Segel
mailto:msegel_had...@hotmail.com>> wrote:
(Sorry sent reply via wrong account.. )
Steve,
Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-)
Usually you will end up having a local Kerberos set up per cluster.
So your machine
is there any way to to see spark class variable values on variable explorer of
spyder for python?
Hi!
I work with a new Spark 2 datasets api. PR:
https://github.com/geotrellis/geotrellis/pull/1675
The idea is to use Datasets[(K, V)] and for example to join by Key of type
K.
The first problems was that there are no Encoders for custom types (not
products), so the workaround was to use Kryo:
h
or alternatively this should work (assuming parsedData is an RDD[Vector]):
clusters.predict(parsedData)
> On 18 Oct 2016, at 00:35, Reth RM wrote:
>
> I think I got it
>
> parsedData.foreach(
> new VoidFunction() {
> @Override
> p
Hi Frank,
Two suggestions
1. I would recommend caching the corpus prior to running LDA
2. If you are using EM I would tweak the sample size using the
setMiniBatchFraction
parameter to decrease the sample per iteration.
-Richard
On Tue, Sep 20, 2016 at 10:27 AM, Frank Zhang <
dataminin...@yahoo
On that note, here is an article that Databricks made regarding using
Tensorflow in conjunction with Spark.
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html
Cheers,
Ben
> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta
> wrote:
>
> while using Deep Lea
Agreed. But as it states deeper integration with (scala) is yet to be
developed.
Any thoughts on how to use tensorflow with scala ? Need to write wrappers I
think.
On Oct 19, 2016 7:56 AM, "Benjamin Kim" wrote:
> On that note, here is an article that Databricks made regarding using
> Tensorflow
See https://issues.apache.org/jira/browse/SPARK-13258 for an explanation
and workaround.
On Wed, Oct 19, 2016 at 1:35 AM, Chanh Le wrote:
> Thank you Daniel,
> Actually I tried this before but this way is still not flexible way if you
> are running multiple jobs at the time and may different dep
Bringing this thread back as I'm seeing this exception on a production
kafka cluster.
I have two Spark streaming apps reading the same topic. App1 has batch
interval 2secs and app2 has 60secs.
Both apps are running on the same cluster on similar hardware. I see this
exception only in app2 and fair
60 seconds for a batch is above the default settings in kafka related
to heartbeat timeouts, so that might be related. Have you tried
tweaking session.timeout.ms, heartbeat.interval.ms, or related
configs?
On Wed, Oct 19, 2016 at 12:22 PM, Srikanth wrote:
> Bringing this thread back as I'm seein
Dear Apache Enthusiast,
ApacheCon Sevilla is now less than a month out, and we need your help
getting the word out. Please tell your colleagues, your friends, and
members of related technical communities, about this event. Rates go up
November 3rd, so register today!
ApacheCon, and Apache Big Dat
Can someone please shed some lights on this. I wrote the below code in
Scala 2.10.5, can someone please tell me if this is the right way of doing
it?
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.functions._
import org.a
Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.
It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and woul
in my case, my model size is fairly small ( 100k training samples ), though
the features count is roughly 100k populated out of 10mil possible features.
in this case it does not help me to distribute the training process, since
data size is so small. I just need a good core solver to train the mod
yeah, the design mainly because hdfs.
--
2015???? 15101549787
-- --
??: "Jakob Odersky";
: 2016??10??20??(??) 4:46
??: "Hyukjin Kwon";
???
Hello there,
I am trying to understand how and when does DataFrame (or Dataset) sets
nullable = true vs false on a schema.
Here is my observation from a sample code I tried...
scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c",
2.0d))).toDF("col1", "col2", "col3").withColumn
Nullable is just a hint to the optimizer that its impossible for there to
be a null value in this column, so that it can avoid generating code for
null-checks. When in doubt, we set nullable=true since it is always safer
to check.
Why in particular are you trying to change the nullability of the
In spark 2.0 we bin-pack small files into a single task to avoid
overloading the scheduler. If you want a specific number of partitions you
should repartition. If you want to disable this optimization you can set
the file open cost very high: spark.sql.files.openCostInBytes
On Tue, Oct 18, 2016
Hello Michael,
Thank you for looking into this query. In my case there seem to be an issue
when I union a parquet file read from disk versus another dataframe that I
construct in-memory. The only difference I see is the containsNull = true.
In fact, I do not see any errors with union on the simple
Hi Group
Sorry to rekindle this thread.
Using Spark 1.6.0 on CDH 5.7.
Any idea?
Best
Ayan
On Fri, Oct 7, 2016 at 5:08 PM, Mich Talebzadeh
wrote:
> Hi Ayan,
>
> Depends on the version of Spark you are using.
>
> Have you tried updating stats in Hive?
>
> ANALYZE TABLE ${DATABASE}.${TABLE} PA
Hi,
I have a case where I use partitionBy to write my DF using a calculated
column, so it looks somethings like this:
val df = spark.sql("select *, from_unixtime(ts, 'MMddH')
partition_key from mytable")
df.write.partitionBy("partition_key").orc("/partitioned_table")
df is 8 partitions in s
hi,all
my issue is everyday I will receive some json datafile , I want to convert
them to parquet file and save to hdfs,
the floder will like this:
/my_table_base_floder
/my_table_base_floder/day_2
/my_table_base_floder/day_3
where the parquet files o
27 matches
Mail list logo