[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1 http://spark.apach

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is dirt

Re: in joins, does one side stream?

2015-09-20 Thread Reynold Xin
;> >> they dont seem specific to structured data analysis to me. >> >> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra < >> rishi80.mis...@gmail.com> wrote: >> >>> Got it..thnx Reynold.. >>> On 20 Sep 2015 07:08, "Reynold Xin"

Re: in joins, does one side stream?

2015-09-19 Thread Reynold Xin
aborate on this. I thought RDD also opens only an > iterator. Does it get materialized for joins? > > Rishi > > On Saturday, September 19, 2015, Reynold Xin wrote: > >> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side >> streams. >> >>

Re: in joins, does one side stream?

2015-09-18 Thread Reynold Xin
Yes for RDD -- both are materialized. No for DataFrame/SQL - one side streams. On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers wrote: > in scalding we join with the smaller side on the left, since the smaller > side will get buffered while the bigger side streams through the join. > > looking a

Re: How to avoid shuffle errors for a large join ?

2015-09-16 Thread Reynold Xin
Only SQL and DataFrame for now. We are thinking about how to apply that to a more general distributed collection based API, but it's not in 1.5. On Sat, Sep 5, 2015 at 11:56 AM, Gurvinder Singh wrote: > On 09/05/2015 11:22 AM, Reynold Xin wrote: > > Try increase the shuffle memor

Re: Perf impact of BlockManager byte[] copies

2015-09-10 Thread Reynold Xin
This is one problem I'd like to address soon - providing a binary block management interface for shuffle (and maybe other things) that avoids serialization/copying. On Fri, Feb 27, 2015 at 3:39 PM, Paul Wais wrote: > Dear List, > > I'm investigating some problems related to native code integrat

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Sandy Ryza wrote: > Java 7. > > FWIW I was just able to get it to work by increasing MaxPermSize to 256m. > > -Sandy > > On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin wrote: > >> Java 7 / 8? >> >> On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza >> wro

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Java 7 / 8? On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza wrote: > I just upgraded the spark-timeseries > project to run on top of > 1.5, and I'm noticing that tests are failing with OOMEs. > > I ran a jmap -histo on the process and discovered the top

[ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Reynold Xin
Hi All, Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page. A huge thanks go to all of the individuals and organizations involved in development and testing of this r

Re: Best way to import data from Oracle to Spark?

2015-09-09 Thread Reynold Xin
Using the JDBC data source is probably the best way. http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin wrote: > What's the best way to import data from Oracle to Spark? Thanks! > > > -- > Best regards! > > Lin,Cui >

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-07 Thread Reynold Xin
On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg wrote: > > BTW, is it possible (or will it be) to use Tungsten with dynamic > allocation and the external shuffle manager? > > Yes - I think this already works. There isn't anything specific here related to Tungsten.

Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Reynold Xin
it takes about 12h to finish (with > 1 shuffle partitions). My hunch is that the reason for that is this: > > INFO ExternalSorter: Thread 3733 spilling in-memory map of 174.9 MB to > disk (62 times so far) > > (and lots more where this comes from). > > On Sat, Aug 29, 2

Re: How to avoid shuffle errors for a large join ?

2015-08-29 Thread Reynold Xin
Can you try 1.5? This should work much, much better in 1.5 out of the box. For 1.4, I think you'd want to turn on sort-merge-join, which is off by default. However, the sort-merge join in 1.4 can still trigger a lot of garbage, making it slower. SMJ performance is probably 5x - 1000x better in 1.5

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov wrote: > Hi, > > I'm using spark 1.3.1 built against hadoop 1

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Reynold Xin
I'm actually somewhat involved with the Google Docs you linked to. I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260 already proposes making Unsafe available. Given the widespread use of Unsafe for performance and advanced functionalities, I don't think Oracle can just remove

Re: Memory allocation error with Spark 1.5

2015-08-05 Thread Reynold Xin
In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when you have 1G of memory allocated in total and have more than 4 threads. We will reduce the default page size before releasing 1.5. For now, you can

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling wrote: > Hi all, > > I have a problem where I have a RDD of elements: > > Item1 Item2 Item3 Item4 Item5 Item6 ... > > and I want to run a function over them to deci

Re: Building scaladoc using "build/sbt unidoc" failure

2015-06-12 Thread Reynold Xin
Try build/sbt clean first. On Tue, May 26, 2015 at 4:45 PM, Justin Yip wrote: > Hello, > > I am trying to build scala doc from the 1.4 branch. But it failed due to > [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: > List(object package$DebugNode, object package$DebugNo

Re: Exception when using CLUSTER BY or ORDER BY

2015-06-12 Thread Reynold Xin
Tom, Can you file a JIRA and attach a small reproducible test case if possible? On Tue, May 19, 2015 at 1:50 PM, Thomas Dudziak wrote: > Under certain circumstances that I haven't yet been able to isolate, I get > the following error when doing a HQL query using HiveContext (Spark 1.3.1 > on M

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Reynold Xin
I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony wrote: > This ticket improved > the RDD API, but it could be even mor

Re: rdd.sample() methods very slow

2015-05-21 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i => Random.nextDouble() < 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter => iter.take(10) } // take the first 10 elements out of each partition

Re: DataFrame Column Alias problem

2015-05-21 Thread Reynold Xin
In 1.4 it actually shows col1 by default. In 1.3, you can add "col1" to the output, i.e. df.groupBy($"col1").agg($"col1", count($"col1").as("c")).show() On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu wrote: > However this returns a single column of c, without showing the original > col1. > ​ > >

Re: Is the AMP lab done next February?

2015-05-11 Thread Reynold Xin
Relaying an answer from AMP director Mike Franklin: "One year into the lab we got a 5 yr Expeditions in Computing Award as part of the White House Big Data initiative in 2012, so we extend the lab for a year. We intend to start winding it down at the end of 2016, while supporting existing projec

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Reynold Xin
as nullable (or not) depending on the input expression's schema ? > > Regards, > > Olivier. > Le lun. 11 mai 2015 à 22:07, Reynold Xin a écrit : > >> Not by design. Would you be interested in submitting a pull request? >> >> On Mon, May 11, 2015 at 1:48 AM, H

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Reynold Xin
Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang wrote: > I try to get the result schema of aggregate functions using DataFrame > API. > > However, I find the result field of groupBy columns are always nullable > even the source fie

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Reynold Xin
Looks like it is spending a lot of time doing hash probing. It could be a number of the following: 1. hash probing itself is inherently expensive compared with rest of your workload 2. murmur3 doesn't work well with this key distribution 3. quadratic probing (triangular sequence) with a power-of

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~

Re: How to distribute Spark computation recipes

2015-04-27 Thread Reynold Xin
The code themselves are the "recipies", no? On Mon, Apr 27, 2015 at 2:49 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > I know that any RDD is related to its SparkContext and the associated > variables (broadcast, accumulators), but I'm looking for a way to > ser

Re: Updating a Column in a DataFrame

2015-04-21 Thread Reynold Xin
You can use df.withColumn("a", df.b) to make column a having the same value as column b. On Mon, Apr 20, 2015 at 3:38 PM, ARose wrote: > In my Java application, I want to update the values of a Column in a given > DataFrame. However, I realize DataFrames are immutable, and therefore > cannot

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread Reynold Xin
You can use the more verbose syntax: d.groupBy("_1").agg(d("_1"), sum("_1").as("sum_1"), sum("_2").as("sum_2")) On Tue, Apr 21, 2015 at 1:06 AM, Justin Yip wrote: > Hello, > > I would like rename a column after aggregation. In the following code, the > column name is "SUM(_1#179)", is there a w

Re: how to make a spark cluster ?

2015-04-20 Thread Reynold Xin
Actually if you only have one machine, just use the Spark local mode. Just download the Spark tarball, untar it, set master to local[N], where N = number of cores. You are good to go. There is no setup of job tracker or Hadoop. On Mon, Apr 20, 2015 at 3:21 PM, haihar nahak wrote: > Thank you :

Re: dataframe can not find fields after loading from hive

2015-04-17 Thread Reynold Xin
This is strange. cc the dev list since it might be a bug. On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores wrote: > Never mind. I found the solution: > > val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, > hiveLoadedDataFrame.schema) > > which translate to convert the data frame to r

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei

Re: [Spark1.3] UDF registration issue

2015-04-13 Thread Reynold Xin
You can do this: strLen = udf((s: String) => s.length()) cleanProcessDF.withColumn("dii",strLen(col("di"))) (You might need to play with the type signature a little bit to get it to compile) On Fri, Apr 10, 2015 at 11:30 AM, Yana Kadiyska wrote: > Hi, I'm running into some trouble trying to r

Re: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Reynold Xin
I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scal

Manning looking for a co-author for the GraphX in Action book

2015-04-13 Thread Reynold Xin
Hi all, Manning (the publisher) is looking for a co-author for the GraphX in Action book. The book currently has one author (Michael Malak), but they are looking for a co-author to work closely with Michael and improve the writings and make it more consumable. Early access page for the book: http

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Reynold Xin
There is already an explode function on DataFrame btw https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712 I think something like this would work. You might need to play with the type. df.explode("arrayBufferColumn") { x => x } On Fri,

Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792 There is also an attempt at adding stddev here already: https://github.com/apache/spark/pull/5228 On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang wrote: > Specifically

Re: Build fails on 1.3 Branch

2015-03-29 Thread Reynold Xin
I pushed a hotfix to the branch. Should work now. On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower wrote: > Yes, that worked - thank you very much. > > > > On Sun, Mar 29, 2015 at 9:05 AM Ted Yu wrote: > >> Jenkins build failed too: >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Sp

Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
means sc.objectFile > should never split files on reading (a feature of hadoop file inputformat > that gets in the way here). > > On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: > >> i just realized the major limitation is that i lose partitioning info... >> >> On Mo

Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > so finally i can resort to: > rdd.saveAsObjectFile(...) > sc.objectFile(...) > but that seems like a rather broken abstraction. > > This seems like a fine solution to me.

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Reynold Xin
They should have the same performance, as they are compiled down to the same execution plan. Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html On Tue, Mar 10, 2015 at 2:13 PM

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to Spark'

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Reynold Xin
The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by running "jekyll build" in docs. Alternatively, just look at dataframe,py as Ted pointed out. On Tue, Feb 24, 2015 at 6:56 AM, Ted Yu wrote: > Have you looked at python/py

Re: New guide on how to write a Spark job in Clojure

2015-02-24 Thread Reynold Xin
Thanks for sharing, Chris. On Tue, Feb 24, 2015 at 4:39 AM, Christian Betz < christian.b...@performance-media.de> wrote: > Hi all, > > Maybe some of you are interested: I wrote a new guide on how to start > using Spark from Clojure. The tutorial covers > >- setting up a project, >- doin

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Reynold Xin
BTW we merged this today: https://github.com/apache/spark/pull/4640 This should allow us in the future to address column by name in a Row. On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust wrote: > I can unpack the code snippet a bit: > > caper.select('ran_id) is the same as saying "SELECT ra

Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar. On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li wrote: > Hi, > > I really like the pipeline in the spark.ml in Spark1.2 release. Will > there be more machine learning algorithms implemented for the pipeline > framework in the next m

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Reynold Xin
I think we made the binary protocol compatible across all versions, so you should be fine with using any one of them. 1.2.1 is probably the best since it is the most recent stable release. On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang wrote: > Hi, > > I need to use branch-1.2 and sometimes mast

Re: 2GB limit for partitions?

2015-02-03 Thread Reynold Xin
cc dev list How are you saving the data? There are two relevant 2GB limits: 1. Caching 2. Shuffle For caching, a partition is turned into a single block. For shuffle, each map partition is partitioned into R blocks, where R = number of reduce tasks. It is unlikely a shuffle block > 2G, altho

Re: How to access OpenHashSet in my standalone program?

2015-01-14 Thread Reynold Xin
s, I can incorporate it to my package and > use it. But I am still wondering why you designed such useful > functions as private. > > On Tue, Jan 13, 2015 at 3:33 PM, Reynold Xin wrote: > > It is not meant to be a public API. If you want to use it, maybe copy the > > code out of th

Re: How to access OpenHashSet in my standalone program?

2015-01-13 Thread Reynold Xin
It is not meant to be a public API. If you want to use it, maybe copy the code out of the package and put it in your own project. On Fri, Jan 9, 2015 at 7:19 AM, Tae-Hyuk Ahn wrote: > Hi, > > I would like to use OpenHashSet > (org.apache.spark.util.collection.OpenHashSet) in my standalone progra

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Reynold Xin
What query did you run? Parquet should have predicate and column pushdown, i.e. if your query only needs to read 3 columns, then only 3 will be read. On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava < a_k_srivast...@yahoo.com.invalid> wrote: > Hi, > I am trying to read a parquet file using - > >

Re: saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Reynold Xin
It is just calling RDD's saveAsTextFile. I guess we should really override the saveAsTextFile in SchemaRDD (or make Row.toString comma separated). Do you mind filing a JIRA ticket and copy me? On Tue, Jan 13, 2015 at 12:03 AM, Kevin Burton wrote: > This is almost funny. > > I want to dump a co

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang wrote: > Hi, > > I have a stupid question: >

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Reynold Xin
Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath wrote: > For one of my Spark jobs, my workers/executors are dying and leaving the > cluster. > > On the master, I

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
at > http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html. > Summary: while Hadoop MapReduce held last year's 100 TB world record by > sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on > 206 nodes; and we also scaled up to sort

Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread Reynold Xin
None of your tuning will help here because the problem is actually the way you are saving the output. If you take a look at the stacktrace, it is trying to build a single string that is too large for the VM to allocate memory. The VM is actually not running out of memory, but rather, JVM cannot sup

Re: something about rdd.collect

2014-10-14 Thread Reynold Xin
Hi Randy, collect essentially transfers all the data to the driver node. You definitely wouldn’t want to collect 200 million words. It is a pretty large number and you can run out of memory on your driver with that much data. --  Reynold Xin On October 14, 2014 at 9:26:13 PM, randylu (randyl

Re: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Reynold Xin
Hi Daoyuan, Do you mind applying this patch and look at the exception again? https://github.com/apache/spark/pull/2580 It has also been merged in master so if you pull from master, you should have that. On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan wrote: > Hi all, > > > > I had some of m

Re: driver memory management

2014-09-28 Thread Reynold Xin
The storage fraction only limits the amount of memory used for storage. It doesn't actually limit anything else. I.e you can use all the memory if you want in collect. On Sunday, September 28, 2014, Brad Miller wrote: > Hi All, > > I am interested to collect() a large RDD so that I can run a lea

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event w

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread Reynold Xin
This is due to the HadoopRDD (and also the underlying Hadoop InputFormat) reuse objects to avoid allocation. It is sort of tricky to fix. However, in most cases you can clone the records to make sure you are not collecting the same object over and over again. https://issues.apache.org/jira/browse/

Re: Powered By Spark: Can you please add our org?

2014-07-08 Thread Reynold Xin
I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio wrote: > Hi, > > Sailthru is also using Spark. Could you please add us to the Powered By > Spark > page > when you have a chance? > > Organization

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed m

openstack swift integration with Spark

2014-06-13 Thread Reynold Xin
If you are interested in openstack/swift integration with Spark, please drop me a line. We are looking into improving the integration. Thanks.

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
d, > > How complex was that job (I guess in terms of number of transforms and > actions) and how long did that take to process? > > -Suren > > > > On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin wrote: > > > Actually we just ran a job with 70TB+ compressed data on

<    1   2