from:"\"Alex Nastetsky\""

different schemas per row with DataFrames

2015-06-18 Thread Alex Nastetsky

I am reading JSON data that has different schemas for every record. That is, for a given field that would have a null value, it's simply absent from that record (and therefore, its schema). I would like to use the DataFrame API to select specific fields from this data, and for fields that are miss

Calling rdd() on a DataFrame causes stage boundary

2015-06-22 Thread Alex Nastetsky

When I call rdd() on a DataFrame, it ends the current stage and starts a new one that just maps the DataFrame to rdd and nothing else. It doesn't seem to do a shuffle (which is good and expected), but then why does why is there a separate stage? I also thought that stages only end when there's a s

spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky

Spark SQL has a "first" function that returns the first item in a group. Is there a similar function, perhaps in a third party lib, that allows you to return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to b

Re: spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky

ut writing a UDF is much simpler than a UDAF. On Tue, Jul 26, 2016 at 11:48 AM, ayan guha wrote: > You can use rank with window function. Rank=1 is same as calling first(). > > Not sure how you would randomly pick records though, if there is no Nth > record. In your example, what

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky

Ran into this need myself. Does Spark have an equivalent of "mapreduce. input.fileinputformat.list-status.num-threads"? Thanks. On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote: > Hi, > > I am wondering if anyone has successfully enabled > "mapreduce.input.fileinputformat.list-status.num-t

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky

https://issues.apache.org/jira/browse/SPARK-9926 > > On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> Ran into this need myself. Does Spark have an equivalent of "mapreduce. >> input.fileinputformat.list-status.num-thr

Databricks Cloud vs AWS EMR

2016-01-26 Thread Alex Nastetsky

As a user of AWS EMR (running Spark and MapReduce), I am interested in potential benefits that I may gain from Databricks Cloud. I was wondering if anyone has used both and done comparison / contrast between the two services. In general, which resource manager(s) does Databricks Cloud use for Spar

spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky

I save my dataframe to avro with spark-avro 1.0.0 and it looks like this (using avro-tools tojson): {"field1":"value1","field2":976200} {"field1":"value2","field2":976200} {"field1":"value3","field2":614100} But when I use spark-avro 2.0.1, it looks like this: {"field1":{"string":"value1"},"fiel

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky

Here you go: https://github.com/databricks/spark-avro/issues/92 Thanks. On Wed, Oct 14, 2015 at 4:41 PM, Josh Rosen wrote: > Can you report this as an issue at > https://github.com/databricks/spark-avro/issues so that it's easier to > track? Thanks! > > On Wed, Oct 14, 2

dataframes and numPartitions

2015-10-14 Thread Alex Nastetsky

A lot of RDD methods take a numPartitions parameter that lets you specify the number of partitions in the result. For example, groupByKey. The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy only takes a bunch of Columns as params. I understand that the DataFrame API is

writing avro parquet

2015-10-19 Thread Alex Nastetsky

Using Spark 1.5.1, Parquet 1.7.0. I'm trying to write Avro/Parquet files. I have this code: sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, classOf[AvroWriteSupport].getName) AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) myDF.write.parquet(outputPath) Th

Re: writing avro parquet

2015-10-19 Thread Alex Nastetsky

Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to use it on myDF.rdd instead of converting it to a PairRDD first. On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Using Spark 1.5.1, Parquet 1.7.0. > > I'm

foreachPartition

2015-10-30 Thread Alex Nastetsky

I'm just trying to do some operation inside foreachPartition, but I can't even get a simple println to work. Nothing gets printed. scala> val a = sc.parallelize(List(1,2,3)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21 scala> a.foreachPartition(p => println("f

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky

Ahh, makes sense. Knew it was going to be something simple. Thanks. On Fri, Oct 30, 2015 at 7:45 PM, Mark Hamstra wrote: > The closure is sent to and executed an Executor, so you need to be looking > at the stdout of the Executors, not on the Driver. > > On Fri, Oct 30, 2015 at 4

CompositeInputFormat in Spark

2015-10-30 Thread Alex Nastetsky

Does Spark have an implementation similar to CompositeInputFormat in MapReduce? CompositeInputFormat joins multiple datasets prior to the mapper, that are partitioned the same way with the same number of partitions, using the "part" number in the file name in each dataset to figure out which file

Sort Merge Join

2015-11-01 Thread Alex Nastetsky

Hi, I'm trying to understand SortMergeJoin (SPARK-2213). 1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". CODE: val sparkCo

Re: Sort Merge Join

2015-11-02 Thread Alex Nastetsky

join keys will be loaded by the > same node/task , since lots of factors need to be considered, like task > pool size, cluster size, source format, storage, data locality etc.,. > > I’ll agree it’s worth to optimize it for performance concerns, and > actually in Hive, it is calle

Re: Spark 1.5 UDAF ArrayType

2015-11-10 Thread Alex Nastetsky

Hi, I believe I ran into the same bug in 1.5.0, although my error looks like this: Caused by: java.lang.ClassCastException: [Lcom.verve.spark.sql.ElementWithCount; cannot be cast to org.apache.spark.sql.types.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getA

Spark SQL UDAF works fine locally, OutOfMemory on YARN

2015-11-16 Thread Alex Nastetsky

Hi, I am using Spark 1.5.1. I have a Spark SQL UDAF that works fine on a tiny dataset (13 narrow rows) in local mode, but runs out of memory on YARN about half the time (OutOfMemory: Java Heap Space). The rest of the time, it works on YARN. Note that in all instances, the input data is the same.

restart from last successful stage

2015-07-28 Thread Alex Nastetsky

Is it possible to restart the job from the last successful stage instead of from the beginning? For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long time and is successful, but the job fails on stage 1, it would be useful to be able to restart from the output of stage 0 inste

Re: restart from last successful stage

2015-07-29 Thread Alex Nastetsky

to core though. For dataframes, depending on what you do, >>> the physical plan may get generated again leading to new RDDs which may >>> cause recomputing all the stages. Consider running the job by generating >>> the RDD from Dataframe and then using that. >>>

dataframe json schema scan

2015-08-20 Thread Alex Nastetsky

The doc for DataFrameReader#json(RDD[String]) method says "Unless the schema is specified using schema function, this function goes through the input once to determine the input schema." https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader Why is this ne

Dataset API inconsistencies

2018-01-09 Thread Alex Nastetsky

I am finding using the Dataset API to be very cumbersome to use, which is unfortunate, as I was looking forward to the type-safety after coming from a Dataframe codebase. This link summarizes my troubles: http://loicdescotte. github.io/posts/spark2-datasets-type-safety/ The problem is having to c

partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky

Is there a way to make outputs created with "partitionBy" to contain the partitioned column? When reading the output with Spark or Hive or similar, it's less of an issue because those tools know how to perform partition discovery. But if I were to load the output into an external data warehouse or

Re: partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky

. On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud wrote: > is this helps? > > sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map((" > foo","bar")=>("foo",("foo","bar"))).partitionBy("foo").json(&qu

"where" clause able to access fields not in its schema

2019-02-13 Thread Alex Nastetsky

I don't know if this is a bug or a feature, but it's a bit counter-intuitive when reading code. The "b" dataframe does not have field "bar" in its schema, but is still able to filter on that field. scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar") a: org.apache.spark.sql.DataF

to_avro/from_avro inserts extra values from Kafka

2020-05-12 Thread Alex Nastetsky

Hi all, I create a dataframe, convert it to Avro with to_avro and write it to Kafka. Then I read it back out with from_avro. (Not using Schema Registry.) The problem is that the values skip every other field in the result. I expect: +-++-+---+ |firstName|lastName|color|

different schemas per row with DataFrames

Calling rdd() on a DataFrame causes stage boundary

spark sql aggregate function "Nth"

Re: spark sql aggregate function "Nth"

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

Databricks Cloud vs AWS EMR

spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

dataframes and numPartitions

writing avro parquet

Re: writing avro parquet

foreachPartition

Re: foreachPartition

CompositeInputFormat in Spark

Sort Merge Join

Re: Sort Merge Join

Re: Spark 1.5 UDAF ArrayType

Spark SQL UDAF works fine locally, OutOfMemory on YARN

restart from last successful stage

Re: restart from last successful stage

dataframe json schema scan

Dataset API inconsistencies

partitionBy with partitioned column in output?

Re: partitionBy with partitioned column in output?

"where" clause able to access fields not in its schema

to_avro/from_avro inserts extra values from Kafka

27 matches

Site Navigation

Mail list logo

Footer information