I am reading JSON data that has different schemas for every record. That
is, for a given field that would have a null value, it's simply absent from
that record (and therefore, its schema).
I would like to use the DataFrame API to select specific fields from this
data, and for fields that are miss
When I call rdd() on a DataFrame, it ends the current stage and starts a
new one that just maps the DataFrame to rdd and nothing else. It doesn't
seem to do a shuffle (which is good and expected), but then why does why is
there a separate stage?
I also thought that stages only end when there's a s
Spark SQL has a "first" function that returns the first item in a group. Is
there a similar function, perhaps in a third party lib, that allows you to
return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing
a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to b
ut writing a UDF is much simpler
than a UDAF.
On Tue, Jul 26, 2016 at 11:48 AM, ayan guha wrote:
> You can use rank with window function. Rank=1 is same as calling first().
>
> Not sure how you would randomly pick records though, if there is no Nth
> record. In your example, what
Ran into this need myself. Does Spark have an equivalent of "mapreduce.
input.fileinputformat.list-status.num-threads"?
Thanks.
On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote:
> Hi,
>
> I am wondering if anyone has successfully enabled
> "mapreduce.input.fileinputformat.list-status.num-t
https://issues.apache.org/jira/browse/SPARK-9926
>
> On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> Ran into this need myself. Does Spark have an equivalent of "mapreduce.
>> input.fileinputformat.list-status.num-thr
As a user of AWS EMR (running Spark and MapReduce), I am interested in
potential benefits that I may gain from Databricks Cloud. I was wondering
if anyone has used both and done comparison / contrast between the two
services.
In general, which resource manager(s) does Databricks Cloud use for Spar
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this
(using avro-tools tojson):
{"field1":"value1","field2":976200}
{"field1":"value2","field2":976200}
{"field1":"value3","field2":614100}
But when I use spark-avro 2.0.1, it looks like this:
{"field1":{"string":"value1"},"fiel
Here you go: https://github.com/databricks/spark-avro/issues/92
Thanks.
On Wed, Oct 14, 2015 at 4:41 PM, Josh Rosen wrote:
> Can you report this as an issue at
> https://github.com/databricks/spark-avro/issues so that it's easier to
> track? Thanks!
>
> On Wed, Oct 14, 2
A lot of RDD methods take a numPartitions parameter that lets you specify
the number of partitions in the result. For example, groupByKey.
The DataFrame counterparts don't have a numPartitions parameter, e.g.
groupBy only takes a bunch of Columns as params.
I understand that the DataFrame API is
Using Spark 1.5.1, Parquet 1.7.0.
I'm trying to write Avro/Parquet files. I have this code:
sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS,
classOf[AvroWriteSupport].getName)
AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$)
myDF.write.parquet(outputPath)
Th
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to
use it on myDF.rdd instead of converting it to a PairRDD first.
On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:
> Using Spark 1.5.1, Parquet 1.7.0.
>
> I'm
I'm just trying to do some operation inside foreachPartition, but I can't
even get a simple println to work. Nothing gets printed.
scala> val a = sc.parallelize(List(1,2,3))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize
at :21
scala> a.foreachPartition(p => println("f
Ahh, makes sense. Knew it was going to be something simple. Thanks.
On Fri, Oct 30, 2015 at 7:45 PM, Mark Hamstra
wrote:
> The closure is sent to and executed an Executor, so you need to be looking
> at the stdout of the Executors, not on the Driver.
>
> On Fri, Oct 30, 2015 at 4
Does Spark have an implementation similar to CompositeInputFormat in
MapReduce?
CompositeInputFormat joins multiple datasets prior to the mapper, that are
partitioned the same way with the same number of partitions, using the
"part" number in the file name in each dataset to figure out which file
Hi,
I'm trying to understand SortMergeJoin (SPARK-2213).
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
CODE:
val sparkCo
join keys will be loaded by the
> same node/task , since lots of factors need to be considered, like task
> pool size, cluster size, source format, storage, data locality etc.,.
>
> I’ll agree it’s worth to optimize it for performance concerns, and
> actually in Hive, it is calle
Hi,
I believe I ran into the same bug in 1.5.0, although my error looks like
this:
Caused by: java.lang.ClassCastException:
[Lcom.verve.spark.sql.ElementWithCount; cannot be cast to
org.apache.spark.sql.types.ArrayData
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getA
Hi,
I am using Spark 1.5.1.
I have a Spark SQL UDAF that works fine on a tiny dataset (13 narrow rows)
in local mode, but runs out of memory on YARN about half the time
(OutOfMemory: Java Heap Space). The rest of the time, it works on YARN.
Note that in all instances, the input data is the same.
Is it possible to restart the job from the last successful stage instead of
from the beginning?
For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long
time and is successful, but the job fails on stage 1, it would be useful to
be able to restart from the output of stage 0 inste
to core though. For dataframes, depending on what you do,
>>> the physical plan may get generated again leading to new RDDs which may
>>> cause recomputing all the stages. Consider running the job by generating
>>> the RDD from Dataframe and then using that.
>>>
The doc for DataFrameReader#json(RDD[String]) method says
"Unless the schema is specified using schema function, this function goes
through the input once to determine the input schema."
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
Why is this ne
I am finding using the Dataset API to be very cumbersome to use, which is
unfortunate, as I was looking forward to the type-safety after coming from
a Dataframe codebase.
This link summarizes my troubles: http://loicdescotte.
github.io/posts/spark2-datasets-type-safety/
The problem is having to c
Is there a way to make outputs created with "partitionBy" to contain the
partitioned column? When reading the output with Spark or Hive or similar,
it's less of an issue because those tools know how to perform partition
discovery. But if I were to load the output into an external data warehouse
or
.
On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud
wrote:
> is this helps?
>
> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map(("
> foo","bar")=>("foo",("foo","bar"))).partitionBy("foo").json(&qu
I don't know if this is a bug or a feature, but it's a bit counter-intuitive
when reading code.
The "b" dataframe does not have field "bar" in its schema, but is still able to
filter on that field.
scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar")
a: org.apache.spark.sql.DataF
Hi all,
I create a dataframe, convert it to Avro with to_avro and write it to
Kafka.
Then I read it back out with from_avro.
(Not using Schema Registry.)
The problem is that the values skip every other field in the result.
I expect:
+-++-+---+
|firstName|lastName|color|
27 matches
Mail list logo