Re: How to reflect dynamic registration udf?

2016-12-16 Thread Cheng Lian
Could you please provide more context about what you are trying to do here? On Thu, Dec 15, 2016 at 6:27 PM 李斌松 wrote: > How to reflect dynamic registration udf? > > java.lang.UnsupportedOperationException: Schema for type _$13 is not > supported > at >

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-24 Thread Cheng Lian
On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: What version of Spark are you using and how many output files does the job writes out? By default, Spark version

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-24 Thread Cheng Lian
lines? Exactly. On Fri, Oct 21, 2016 at 3:39 PM Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: Efe - You probably hit this bug: https://issues.apache.org/jira/browse/SPARK-18058 On 10/21/16 2:03 AM, Agraj Mangal wrote: I have see

Re: RDD groupBy() then random sort each group ?

2016-10-21 Thread Cheng Lian
I think it would much easier to use DataFrame API to do this by doing local sort using randn() as key. For example, in Spark 2.0: val df = spark.range(100) val shuffled = df.repartition($"id" % 10).sortWithinPartitions(randn(42)) Replace df with a DataFrame wrapping your RDD, and $"id" % 10

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-21 Thread Cheng Lian
Efe - You probably hit this bug: https://issues.apache.org/jira/browse/SPARK-18058 On 10/21/16 2:03 AM, Agraj Mangal wrote: I have seen this error sometimes when the elements in the schema have different nullabilities. Could you print the schema for data and for

Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Cheng Lian
You may either use SQL function "array" and "named_struct" or define a case class with expected field names. Cheng On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to:

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Yea, confirmed. While analyzing unions, we treat StructTypes with different field nullabilities as incompatible types and throws this error. Opened https://issues.apache.org/jira/browse/SPARK-18058 to track this issue. Thanks for reporting! Cheng On 10/21/16 3:15 PM, Cheng Lian wrote: Hi

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Hi Muthu, What is the version of Spark are you using? This seems to be a bug in the analysis phase. Cheng On 10/21/16 12:50 PM, Muthu Jayakumar wrote: Sorry for the late response. Here is what I am seeing... Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable =

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-21 Thread Cheng Lian
What version of Spark are you using and how many output files does the job writes out? By default, Spark versions before 1.6 (not including) writes Parquet summary files when committing the job. This process reads footers from all Parquet files in the destination directory and merges them

Re: Where condition on columns of Arrays does no longer work in spark 2

2016-10-21 Thread Cheng Lian
Thanks for reporting! It's a bug, just filed a ticket to track it: https://issues.apache.org/jira/browse/SPARK-18053 Cheng On 10/20/16 1:54 AM, filthysocks wrote: I have a Column in a DataFrame that contains Arrays and I wanna filter for equality. It does work fine in spark 1.6 but not in

Re: Consuming parquet files built with version 1.8.1

2016-10-17 Thread Cheng Lian
Hi Dinesh, Thanks for reporting. This is kinda weird and I can't reproduce this. Were doing the experiments using a clean compiled Spark master branch? And I don't think you have to use parquet-mr 1.8.1 to read Parquet files generated using parquet-mr 1.8.1 unless you are using something not

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-12 Thread Cheng Lian
OK, I've merged this PR to master and branch-2.0. On 8/11/16 8:27 AM, Cheng Lian wrote: Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-10 Thread Cheng Lian
Haven't figured out the exactly way how it failed, but the leading underscore in the partition directory name looks suspicious. Could you please try this PR to see whether it fixes the issue: https://github.com/apache/spark/pull/14585/files Cheng On 8/9/16 5:38 PM, immerrr again wrote:

Re: 回复: Bug about reading parquet files

2016-07-09 Thread Cheng Lian
According to our offline discussion, the target table consists of 1M+ small Parquet files (~12M by average). The OOM occurred at driver side while listing input files. My theory is that the total size of all listed FileStatus objects is too large for the driver and caused the OOM.

Re: Bug about reading parquet files

2016-07-08 Thread Cheng Lian
What's the Spark version? Could you please also attach result of explain(extended = true)? On Fri, Jul 8, 2016 at 4:33 PM, Sea <261810...@qq.com> wrote: > I have a problem reading parquet files. > sql: > select count(1) from omega.dwd_native where year='2016' and month='07' > and day='05' and

Re: Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-15 Thread Cheng Lian
Spark 1.6.1 is also using 1.7.0. Could you please share the schema of your Parquet file as well as the exact exception stack trace reported by Hive? Cheng On 6/13/16 12:56 AM, mayankshete wrote: Hello Team , I am facing an issue where output files generated by Spark 1.6.1 are not read by

Re: update mysql in spark

2016-06-15 Thread Cheng Lian
Spark SQL doesn't support update command yet. On Wed, Jun 15, 2016, 9:08 AM spR wrote: > hi, > > can we write a update query using sqlcontext? > > sqlContext.sql("update act1 set loc = round(loc,4)") > > what is wrong in this? I get the following error. > >

Re: feedback on dataset api explode

2016-05-25 Thread Cheng Lian
Agree, since they can be easily replaced by .flatMap (to do explosion) and .select (to rename output columns) Cheng On 5/25/16 12:30 PM, Reynold Xin wrote: Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Cheng Lian
Parquet is a read-only format. So the only way to remove data from a written Parquet file is to write a new Parquet file without unwanted rows. Cheng On 2/17/16 5:11 AM, SRK wrote: Hi, I am saving my records in the form of parquet files using dataframes in hdfs. How to delete the records

Re: cast column string -> timestamp in Parquet file

2016-01-25 Thread Cheng Lian
The following snippet may help: sqlContext.read.parquet(path).withColumn("col_ts", $"col".cast(TimestampType)).drop("col") Cheng On 1/21/16 6:58 AM, Muthu Jayakumar wrote: DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all data belonging to a single (data) partition into a single (RDD) partition: |df.coalesce(1)|||.repartition("entity", "year", "month", "day", "status")|.write.partitionBy("entity", "year", "month", "day",

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian
. Best, Gavin On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote th

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian
Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Cheng Lian
hanks, -Matt On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian <l...@databricks.com <mailto:l...@databricks.com>> wrote: This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it he

Re: About the bottleneck of parquet file reading in Spark

2015-12-10 Thread Cheng Lian
Cc Spark user list since this information is generally useful. On Thu, Dec 10, 2015 at 3:31 PM, Lionheart <87249...@qq.com> wrote: > Dear, Cheng > I'm a user of Spark. Our current Spark version is 1.4.1 > In our project, I find there is a bottleneck when loading huge amount > of

Re: memory leak when saving Parquet files in Spark

2015-12-10 Thread Cheng Lian
This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps: df.write .format("parquet") .option("mergeSchema", "false") .partitionBy(partitionCols: _*) .mode(saveMode) .save(targetPath)

Re: parquet file doubts

2015-12-08 Thread Cheng Lian
com <mailto:absi...@informatica.com>> wrote: Yes, Parquet has min/max. *From:*Cheng Lian [mailto:l...@databricks.com <mailto:l...@databricks.com>] *Sent:* Monday, December 07, 2015 11:21 AM *To:* Ted Yu *Cc:* Shushant Arora; user@spark.apache.org <mailto:

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Arora wrote: Hi I have few doubts on parquet file format. 1.Does parquet keeps min max statistics like in ORC. how can I see parquet version(whether its1.1,1.2or1.3) for

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
, Ted Yu wrote: Cheng: I only see user@spark in the CC. FYI On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian <l...@databricks.com <mailto:l...@databricks.com>> wrote: cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Cheng Lian
You may try to set Hadoop conf "parquet.enable.summary-metadata" to false to disable writing Parquet summary files (_metadata and _common_metadata). By default Parquet writes the summary files by collecting footers of all part-files in the dataset while committing the job. Spark also follows

Re: DateTime Support - Hive Parquet

2015-11-29 Thread Cheng Lian
for this case? Do you convert on insert or on RDD to DF conversion? Regards, Bryan Jeffrey Sent from Outlook Mail *From: *Cheng Lian *Sent: *Tuesday, November 24, 2015 6:49 AM *To: *Bryan;user *Subject: *Re: DateTime Support - Hive Parquet I see, then this is actually irrelevant to Parquet. I

Re: Parquet files not getting coalesced to smaller number of files

2015-11-29 Thread Cheng Lian
RDD.coalesce(n) returns a new RDD rather than modifying the original RDD. So what you need is: metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...) Cheng On 11/29/15 12:21 PM, SRK wrote: Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My

Re: DateTime Support - Hive Parquet

2015-11-24 Thread Cheng Lian
(to nanos, Timestamp, etc) prior to writing records to hive. Regards, Bryan Jeffrey Sent from Outlook Mail *From: *Cheng Lian *Sent: *Tuesday, November 24, 2015 1:42 AM *To: *Bryan Jeffrey;user *Subject: *Re: DateTime Support - Hive Parquet Hey Bryan, What do you mean by "DateTime prope

Re: DateTime Support - Hive Parquet

2015-11-23 Thread Cheng Lian
Hey Bryan, What do you mean by "DateTime properties"? Hive and Spark SQL both support DATE and TIMESTAMP types, but there's no DATETIME type. So I assume you are referring to Java class DateTime (possibly the one in joda)? Could you please provide a sample snippet that illustrates your

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
ns where this rdd will lend.Have you used multiple output formats in spark? On Fri, Nov 13, 2015 at 3:56 PM, Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: Oh I see. Then parquet-avro should probably be more useful. AFAIK, parquet-hive is only

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian
This is because of PARQUET-369 , which prevents users or other libraries to override Parquet's JUL logging settings via SLF4J. It has been fixed in the most recent parquet-format master (PR #32

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
of your responses are there either. I am definitely subscribed to the list though (I get daily digests). Any clue how to fix it? Sorry, no idea :-/ On Nov 6, 2015, at 9:26 AM, Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: I'd expect writing Parquet

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a partition key while SPARK-11153 is about Parquet filter push-down and doesn't affect partition pruning. Cheng On 11/3/15 7:14 PM, Rex Xiong wrote: We found the query performance is very poor due to this issue

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Cheng Lian
Hi Hyukjin, Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option. Cheng On 10/8/15 11:04 PM,

Re: Parquet file size

2015-10-08 Thread Cheng Lian
<mailto:younes.nag...@streamtheworld.com>** *From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] *Sent:* Wednesday, October 07, 2015 9:14 PM *To:* Younes Naguib *Cc:* Cheng Lian

Re: Parquet file size

2015-10-07 Thread Cheng Lian
, without month and day). Cheng So you want to dump all data into a single large Parquet file? On 10/7/15 1:55 PM, Younes Naguib wrote: The TSV original files is 600GB and generated 40k files of 15-25MB. y *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* October-07-15 3:18 PM

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month,

Re: Metadata in Parquet

2015-09-30 Thread Cheng Lian
Unfortunately this isn't supported at the moment https://issues.apache.org/jira/browse/SPARK-10803 Cheng On 9/30/15 10:54 AM, Philip Weaver wrote: Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
I guess you're probably using Spark 1.5? Spark SQL does support schema merging, but we disabled it by default since 1.5 because it introduces extra performance costs (it's turned on by default in 1.4 and 1.3). You may enable schema merging via either the Parquet data source specific option

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
Also, you may find more details in the programming guide: - http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging - http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration Cheng On 9/28/15 3:54 PM, Cheng Lian wrote: I guess you're probably using

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
! The problem now is to filter out bad (miswritten) Parquet files, as they are causing this operation to fail. Any suggestions on detecting them quickly and easily? *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 5:56 PM *To:* Thomas, Jordan <jordan.

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
nd re-transferred. Thanks, Jordan *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Monday, September 28, 2015 6:15 PM *To:* Thomas, Jordan <jordan.tho...@accenture.com>; mich...@databricks.com *Cc:* user@spark.apache.org *Subject:* Re: Performance when iterating over many parquet

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
Please set the the SQL option spark.sql.parquet.binaryAsString to true when reading Parquet files containing strings generated by Hive. This is actually a bug of parquet-hive. When generating Parquet schema for a string field, Parquet requires a "UTF8" annotation, something like: message

Re: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread Cheng Lian
BTW, just checked that this bug should have been fixed since Hive 0.14.0. So the SQL option I mentioned is mostly used for reading legacy Parquet files generated by older versions of Hive. Cheng On 9/25/15 2:42 PM, Cheng Lian wrote: Please set the the SQL option

Re: Using Map and Basic Operators yield java.lang.ClassCastException (Parquet + Hive + Spark SQL 1.5.0 + Thrift)

2015-09-25 Thread Cheng Lian
ndling INT is all good but float and double are causing the exception. Thanks. Dominic Ricard Triton Digital -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Thursday, September 24, 2015 5:47 PM To: Dominic Ricard; user@spark.apache.org Subject: Re: Using Map a

Re: spark + parquet + schema name and metadata

2015-09-24 Thread Cheng Lian
t out to see if this works ok. I am planning to use "stable" metadata - so those will be same across all parquet files inside directory hierarchy... On Tue, 22 Sep 2015 at 18:54 Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> wrote: Mi

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
ot; them in some way (giving the schema appropriate name or attaching some key/values) and then it is fairly easy to get basic metadata about parquet files when processing and discovering those later on. On Mon, 21 Sep 2015 at 18:17 Cheng Lian <lian.cs@gmail.com <mailto:lia

Re: spark + parquet + schema name and metadata

2015-09-21 Thread Cheng Lian
Currently Spark SQL doesn't support customizing schema name and metadata. May I know why these two matters in your use case? Some Parquet data models, like parquet-avro, do support it, while some others don't (e.g. parquet-hive). Cheng On 9/21/15 7:13 AM, Borisa Zivkovic wrote: Hi, I am

Re: parquet error

2015-09-18 Thread Cheng Lian
Not sure what's happening here, but I guess it's probably a dependency version issue. Could you please give vanilla Apache Spark a try to see whether its a CDH specific issue or not? Cheng On 9/17/15 11:44 PM, Chengi Liu wrote: Hi, I did some digging.. I believe the error is caused by

Re: Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Cheng Lian
If you don't need to interact with Hive, you may compile Spark without using the -Phive flag to eliminate Hive dependencies. In this way, the sqlContext instance in Spark shell will be of type SQLContext instead of HiveContext. The reason behind the Hive metastore error is probably due to

Re: How to read compressed parquet file

2015-09-09 Thread Cheng Lian
You need to use "har://" instead of "hdfs://" to read HAR files. Just tested against Spark 1.5, and it works as expected. Cheng On 9/9/15 3:29 PM, 李铖 wrote: I think too many parquet files may be affect reading capability,so I use hadoop archive to combine them,but

Re: Split content into multiple Parquet files

2015-09-08 Thread Cheng Lian
In Spark 1.4 and 1.5, you can do something like this: df.write.partitionBy("key").parquet("/datasink/output-parquets") BTW, I'm curious about how did you do it without partitionBy using saveAsHadoopFile? Cheng On 9/8/15 2:34 PM, Adrien Mogenet wrote: Hi there, We've spent several hours to

Re: Parquet Array Support Broken?

2015-09-08 Thread Cheng Lian
f the file is created in Spark On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov <dautkha...@gmail.com <mailto:dautkha...@gmail.com>> wrote: Read response from Cheng Lian <lian.cs@gmail.com <mailto:lian.cs@gmail.com>> on Aug/27th - it looks the same prob

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
(valueContainsNull = false) |-- imp2: map (nullable = true) ||-- key: string ||-- value: double (valueContainsNull = false) |-- imp3: map (nullable = true) ||-- key: string ||-- value: double (valueContainsNull = false) On Thu, Sep 3, 2015 at 11:27 PM, Cheng Lian <lian

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
Could you please provide the full stack track of the OOM exception? Another common case of Parquet OOM is super wide tables, say hundred or thousands of columns. And in this case, the number of rows is mostly irrelevant. Cheng On 9/4/15 1:24 AM, Kohki Nishio wrote: let's say I have a data

Re: Schema From parquet file

2015-09-01 Thread Cheng Lian
What exactly do you mean by "get schema from a parquet file"? - If you are trying to inspect Parquet files, parquet-tools can be pretty neat: https://github.com/Parquet/parquet-mr/issues/321 - If you are trying to get Parquet schema of Parquet MessageType, you may resort to readFooterX() and

Re: Group by specific key and save as parquet

2015-09-01 Thread Cheng Lian
Starting from Spark 1.4, you can do this via dynamic partitioning: sqlContext.table("trade").write.partitionBy("date").parquet("/tmp/path") Cheng On 9/1/15 8:27 AM, gtinside wrote: Hi , I have a set of data, I need to group by specific key and then save as parquet. Refer to the code snippet

Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread Cheng Lian
Could you please show jstack result of the hanged process? Thanks! Cheng On 8/26/15 10:46 PM, cingram wrote: I have a simple test that is hanging when using s3a with spark 1.3.1. Is there something I need to do to cleanup the S3A file system? The write to S3 appears to have worked but this job

Re: How to overwrite partition when writing Parquet?

2015-08-20 Thread Cheng Lian
You can apply a filter first to filter out data of needed dates and then append them. Cheng On 8/20/15 4:59 PM, Hemant Bhanawat wrote: How can I overwrite only a given partition or manually remove a partition before writing? I don't know if (and I don't think) there is a way to do that

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-12 Thread Cheng Lian
. And you can sort the record before writing out, and then you will get the parquet files without overlapping keys. Let us know if that helps. Hao *From:*Philip Weaver [mailto:philip.wea...@gmail.com] *Sent:* Wednesday, August 12, 2015 4:05 AM *To:* Cheng Lian *Cc:* user *Subject:* Re: Very high

Re: Parquet without hadoop: Possible?

2015-08-12 Thread Cheng Lian
One thing to note is that, it would be good to add explicit file system scheme to the output path (i.e. file:///var/... instead of /var/...), esp. when you do have HDFS running. Because in this case the data might be written to HDFS rather than your local file system if Spark found Hadoop

Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Cheng Lian
The conflicting metadata values warning is a known issue https://issues.apache.org/jira/browse/PARQUET-194 The option parquet.enable.summary-metadata is a Hadoop option rather than a Spark option, so you need to either add it to your Hadoop configuration file(s) or add it via

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Cheng Lian
It doesn't seem to be Parquet 1.7.0 since the package name isn't under org.apache.parquet (1.7.0 is the first official Apache release of Parquet). The version you were using is probably Parquet 1.6.0rc3 according to the line number information:

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
is mysteriously low... Cheng On 8/7/15 3:32 PM, Cheng Lian wrote: Hi Philip, Thanks for providing the log file. It seems that most of the time are spent on partition discovery. The code snippet you provided actually issues two jobs. The first one is for listing the input directories to find out all

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
a DataFrame manually, and see if I can query it with Spark SQL with reasonable performance. - Philip On Thu, Aug 6, 2015 at 8:37 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Would you mind to provide the driver log? On 8/6/15 3:58 PM, Philip Weaver wrote

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-06 Thread Cheng Lian
philip.wea...@gmail.com mailto:philip.wea...@gmail.com wrote: Absolutely, thanks! On Wed, Aug 5, 2015 at 9:07 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-05 Thread Cheng Lian
We've fixed this issue in 1.5 https://github.com/apache/spark/pull/7396 Could you give it a shot to see whether it helps in your case? We've observed ~50x performance boost with schema merging turned on. Cheng On 8/6/15 8:26 AM, Philip Weaver wrote: I have a parquet directory that was

Re: Safe to write to parquet at the same time?

2015-08-04 Thread Cheng Lian
It should be safe for Spark 1.4.1 and later versions. Now Spark SQL adds a job-wise UUID to output file names to distinguish files written by different write jobs. So those two write jobs you gave should play well with each other. And the job committed later will generate a summary file for

Re: Parquet SaveMode.Append Trouble.

2015-08-04 Thread Cheng Lian
You need to import org.apache.spark.sql.SaveMode Cheng On 7/31/15 6:26 AM, satyajit vegesna wrote: Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile(/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet) Have

Re: Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Cheng Lian
Hi Jerry, Thanks for the detailed report! I haven't investigate this issue in detail. But for the input size issue, I believe this is due to a limitation of HDFS API. It seems that Hadoop FileSystem adds the size of a whole block to the metrics even if you only touch a fraction of that

Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
: BINARY, OriginalType: ENUM) Valid types for this column are: null Is it because Spark does not recognize ENUM type in parquet? Best Regards, Jerry On Wed, Jul 22, 2015 at 12:21 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: On 7/22/15 9:03 AM, Ankit wrote

Re: writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-24 Thread Cheng Lian
I don’t think this is a bug either. For an empty JSON array |[]|, there’s simply no way to infer its actual data type, and in this case Spark SQL just tries to fill in the “safest” type, which is |StringType|, because basically you can cast any data type to |StringType|. In general, schema

Re: Partition parquet data by ENUM column

2015-07-24 Thread Cheng Lian
problem here is that Spark SQL can’t prevent pushing down a predicate over an ENUM field since it sees the field as a normal string field. Would you mind to file a JIRA ticket? Cheng On 7/24/15 2:14 PM, Cheng Lian wrote: Could you please provide the full stack trace of the exception? And what's

Re: Parquet problems

2015-07-22 Thread Cheng Lian
How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
to manually alter the table everytime the underlying schema changes? Thanks On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hey Jerrick, What do you mean by schema evolution with Hive metastore tables? Hive

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
schema evolution. So what is the best way to support CLI queries in this situation? Do I need to manually alter the table everytime the underlying schema changes? Thanks On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hey Jerrick

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
. Do you mean how to verify whether partition pruning is effective? You should be able to see log lines like this: 15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned 66.67% partitions. On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian lian.cs

Re: Partition parquet data by ENUM column

2015-07-21 Thread Cheng Lian
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the master branch. https://github.com/apache/spark/pull/7048 ENUM types are actually not in the Parquet format spec, that's why we didn't have it at the first place. Basically, ENUMs are always treated as UTF8 strings in

Re: Spark-hive parquet schema evolution

2015-07-21 Thread Cheng Lian
Hey Jerrick, What do you mean by schema evolution with Hive metastore tables? Hive doesn't take schema evolution into account. Could you please give a concrete use case? Are you trying to write Parquet data with extra columns into an existing metastore Parquet table? Cheng On 7/21/15 1:04

Re: what is : ParquetFileReader: reading summary file ?

2015-07-17 Thread Cheng Lian
Yeah, Spark SQL Parquet support need to do some metadata discovery when firstly importing a folder containing Parquet files, and discovered metadata is cached. Cheng On 7/17/15 1:56 PM, shsh...@tsmc.com wrote: Hi all, our scenario is to generate lots of folders containinig parquet file and

Re: DataFrame.write().partitionBy(some_column).parquet(path) produces OutOfMemory with very few items

2015-07-16 Thread Cheng Lian
Hi Nikos, How many columns and distinct values of some_column are there in the DataFrame? Parquet writer is known to be very memory consuming for wide tables. And lots of distinct partition column values result in many concurrent Parquet writers. One possible workaround is to first

Re: How to disable parquet schema merging in 1.4?

2015-07-01 Thread Cheng Lian
With Spark 1.4, you may use data source option mergeSchema to control it: sqlContext.read.option(mergeSchema, false).parquet(some/path) or CREATE TABLE t USING parquet OPTIONS (mergeSchema false, path some/path) We're considering to disable schema merging by default in 1.5.0 since it

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Cheng Lian
What's the size of this table? Is the data skewed (so that speculation is probably triggered)? Cheng On 6/15/15 10:37 PM, Night Wolf wrote: Hey Yin, Thanks for the link to the JIRA. I'll add details to it. But I'm able to reproduce it, at least in the same shell session, every time I do a

Re: BigDecimal problem in parquet file

2015-06-17 Thread Cheng Lian
it properly. Thanks for helping out. Bipin On 12 June 2015 at 14:57, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: On 6/10/15 8:53 PM, Bipin Nag wrote: Hi Cheng, I am using Spark 1.3.1 binary available for Hadoop 2.6

Re: HiveContext saveAsTable create wrong partition

2015-06-17 Thread Cheng Lian
Thanks for reporting this. Would you mind to help creating a JIRA for this? On 6/16/15 2:25 AM, patcharee wrote: I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Cheng Lian
Hi Nathan, Thanks a lot for the detailed report, especially the information about nonconsecutive part numbers. It's confirmed to be a race condition bug and just filed https://issues.apache.org/jira/browse/SPARK-8406 to track this. Will deliver a fix ASAP and this will be included in 1.4.1.

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Cheng Lian
find! Let me know if there is anything we can do to help on this end with contributing a fix or testing. Side note - any ideas on the 1.4.1 eta? There are a few bug fixes we need in there. Cheers, Nathan From: Cheng Lian Date: Wednesday, 17 June 2015 6:25 pm To: Nathan, user@spark.apache.org

Re: Dataframe Write : Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.

2015-06-13 Thread Cheng Lian
As the error message says, were you using a |SQLContext| instead of a |HiveContext| to create the DataFrame? In Spark shell, although the variable name is |sqlContext|, the type of that variable is actually |org.apache.spark.sql.hive.HiveContext|, which has the ability to communicate with

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
have ever seen the OOM stderr log on slave node. But recently there seems no OOM log on slave node. Follow the cmd 、data 、env and the code I gave you, the OOM can 100% repro on cluster mode. Thanks, SuperJ - 原始邮件信息 - *发件人:* Cheng Lian l...@databricks.com *收件人:* 姜超才 jiangchao

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng Lian
is not performed). Cheng On 6/12/15 4:17 PM, Cheng, Hao wrote: Not sure if Spark Core will provide API to fetch the record one by one from the block manager, instead of the pulling them all into the driver memory. *From:*Cheng Lian [mailto:l...@databricks.com] *Sent:* Friday, June 12, 2015 3:51 PM

  1   2   3   4   >