Hi,
I think this is just the overhead to represent nested elements as internal
rows on-runtime
(e.g., it consumes null bits for each nested element).
Moreover, in parquet formats, nested data are columnar and highly
compressed,
so it becomes so compact.
But, I'm not sure about better approaches
The Dataset is defined as case class with many fields with nested
structure(Map, List of another case class etc.)
The size of the Dataset is only 1T when saving to disk as Parquet file.
But when joining it, the shuffle write size becomes as large as 12T.
Is there a way to cut it down without
bcc dev@ and add user@
This is more a user@ list question rather than a dev@ list question. You
can do something like this:
object MySimpleApp {
def loadResources(): Unit = // define some idempotent way to load
resources, e.g. with a flag or lazy val
def main() = {
...
have the first path to be something like
.csv("file://home/user/dataset/data.csv")
If you working with files that big
-don't use the inferSchema option, as that will trigger two scans through the
data
-try with a smaller file first, say 1MB or so
Trying to use spark *or any other tool* to
case class RowObj(page: Int, country: String, id: String, step: Int)
sc.cassandraTable[RowObj]("keyspace","table").map(row => row.copy(step =
8)).saveToCassandra("keyspace","table",SomeColumns("page","country","id","step"))
Hi guys. Im pretty new to spark and more on Scala.
Im trying to do a
Hi there!
Does Apache Spark offers callback or some kind of plugin mechanism that
would allow this?
For a bit more context, I am looking to create a Record-Replay environment
using Apache Spark as the core processing engine. The goals are:
1. Trace the origins of every file generated by Spark:
Hello All,
Is there anyone who has developed multilabel classification applications
with Spark?
I found an example class in Spark distribution (i.e.,
*JavaMultiLabelClassificationMetricsExample.java*) which is not a
classifier but an evaluator for a multilabel classification. Moreover, the
Thanks Rohit, Roddick and Shreya. I tried
changing spark.yarn.executor.memoryOverhead to be 10GB and lowering
executor memory to 30 GB and both of these didn't work. I finally had to
reduce the number of cores per executor to be 18 (from 36) in addition to
setting higher
Hi,
I have run into a weird caching problem (Using Spark 1.3.1 + Java 1.8.0) that I
can only explain as a bug.
In summary, I source the RDD from an Avro file, I apply a mapToPair Function,
count & cache. However, the RDD is not cached nor it appears in Spark UI
Storage. (This is not cached at
This question seems to deserve an scalation from Stack Overflow:
http://stackoverflow.com/questions/40803969/spark-size-exceeds-integer-max-value-when-joining-2-large-dfs
Looks like an important limitation.
-kr, Gerard.
Meta:PS: What do you think would be the best way to scalate from SO?
I provided an answer to a similar question here:
https://www.mail-archive.com/user@spark.apache.org/msg57697.html
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
Hi,
Interesting, but I personally would opt for withColumn since it'd be
less to type (and also be consistent with ticks (')) as follows:
df.withColumn(explode('myArray) as 'arrayItem)
(Spark SQL made my SQL developer's life so easy these days :))
Pozdrawiam,
Jacek Laskowski
12 matches
Mail list logo