Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote: Hello everyone, I have

Re: hbase scan performance

2014-04-09 Thread Jerry Lam
Hi Dave, This is HBase solution to the poor scan performance issue: https://issues.apache.org/jira/browse/HBASE-8369 I encountered the same issue before. To the best of my knowledge, this is not a mapreduce issue. It is hbase issue. If you are planning to swap out mapreduce and replace it with

Spark Summit 2014 (Hotel suggestions)

2014-05-06 Thread Jerry Lam
Hi Spark users, Do you guys plan to go the spark summit? Can you recommend any hotel near the conference? I'm not familiar with the area. Thanks! Jerry

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Jerry Lam
Hi guys, I ended up reserving a room at the Phoenix (Hotel: http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel) recommended by my friend who has been in SF. According to Google, it takes 11min to walk to the conference which is not too bad. Hope this helps! Jerry

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
Hi Konstantin, I just ran into the same problem. I mitigated the issue by reducing the number of cores when I executed the job which otherwise it won't be able to finish. Unlike many people believes, it might not means that you were running out of memory. A better answer can be found here:

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM,

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without

Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Spark developers, I have the following hqls that spark will throw exceptions of this kind: 14/07/10 15:07:55 INFO TaskSetManager: Loss was due to org.apache.spark.TaskKilledException [duplicate 17] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:736 failed 4 times,

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql(select * from m).count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote: By the way, I also try hql(select * from m).count. It is terribly

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote

Re: Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
provide the output of the following command: println(hql(select s.id from m join s on (s.id=m_id)).queryExecution) Michael On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark developers, I have the following hqls that spark will throw exceptions of this kind: 14

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Jerry Lam
Hi there, I think the question is interesting; a spark of sparks = spark I wonder if you can use the spark job server ( https://github.com/ooyala/spark-jobserver)? So in the spark task that requires a new spark context, instead of creating it in the task, contact the job server to create one and

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Then yarn application -kill appid should work. This is what I did 2 hours ago. Sorry I cannot provide more help. Sent from my iPhone On 14 Jul, 2014, at 6:05 pm, hsy...@gmail.com hsy...@gmail.com wrote: yarn-cluster On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam chiling...@gmail.com wrote

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, can you describe your spark cluster setup? I saw localhost:2181 for zookeeper. Best Regards, Jerry On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Could you please help me to resolve the issue. *Issue *: I'm not able to connect

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, I have a feeling that this is not directly related to spark but I might be wrong. The reason why is that when you do: Configuration configuration = HBaseConfiguration.create(); by default, it reads the configuration files hbase-site.xml in your classpath and ... (I don't remember

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi guys, Sorry, I'm also interested in this nested json structure. I have a similar SQL in which I need to query a nested field in a json. Does the above query works if it is used with sql(sqlText) assuming the data is coming directly from hdfs via sqlContext.jsonFile? The SPARK-2483

Re: Need help on spark Hbase

2014-07-16 Thread Jerry Lam
, stacktraces, exceptions, etc. TD On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam chiling...@gmail.com wrote: Hi Rajesh, I have a feeling that this is not directly related to spark but I might be wrong. The reason why is that when you do: Configuration configuration

Filtering nested data using Spark SQL

2014-12-10 Thread Jerry Lam
Hi spark users, I'm trying to filter a json file that has the following schema using Spark SQL: root |-- user_id: string (nullable = true) |-- item: array (nullable = true) ||-- element: struct (containsNull = false) |||-- item_id: string (nullable = true) |||-- name:

Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi spark users, Do you know how to access rows of row? I have a SchemaRDD called user and register it as a table with the following schema: root |-- user_id: string (nullable = true) |-- item: array (nullable = true) ||-- element: struct (containsNull = false) |||-- item_id:

Re: Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
== 1 } res0: Int = 1 ...else: scala items.count { case (user_id, name) = user_id == 1 } res1: Int = 1 On Mon, Dec 15, 2014 at 11:04 AM, Jerry Lam chiling...@gmail.com wrote: Hi spark users, Do you know how to access rows of row? I have a SchemaRDD called user and register

Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed? I'm looking into sqlContext.jsonFile but I don't know how to configure it to read lzo files. Best Regards, Jerry

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
, Dec 17, 2014 at 11:27 AM, Ted Yu yuzhih...@gmail.com wrote: See this thread: http://search-hadoop.com/m/JW1q5HAuFv which references https://issues.apache.org/jira/browse/SPARK-2394 Cheers On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote: Hi spark users, Do you know

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
at 8:33 AM, Jerry Lam chiling...@gmail.com wrote: Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because sqlContext.josnFile does not provide ways to configure the input file format. Do you know

UNION two RDDs

2014-12-18 Thread Jerry Lam
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

Re: UNION two RDDs

2014-12-22 Thread Jerry Lam
AFAIK. On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create the external table. For example:

Spark or Tachyon: capture data lineage

2015-01-02 Thread Jerry Lam
Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another

Re: Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Jerry Lam
Hi Florin, I might be wrong but timestamp looks like a keyword in SQL that the engine gets confused with. If it is a column name of your table, you might want to change it. ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) I'm constantly working with CSV files with spark.

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-19 Thread Jerry Lam
Hi guys, Does this issue affect 1.2.0 only or all previous releases as well? Best Regards, Jerry On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote: Yes, the problem is, I've turned the flag on. One possible reason for this is, the parquet file supports predicate

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Jerry Lam
Hi Sudipta, I would also like to suggest to ask this question in Cloudera mailing list since you have HDFS, MAPREDUCE and Yarn requirements. Spark can work with HDFS and YARN but it is more like a client to those clusters. Cloudera can provide services to answer your question more clearly. I'm

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-09 Thread Jerry Lam
. There is parquet-mr project that uses hadoop to do so. I am trying to write a spark job to do similar kind of thing. On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam chiling...@gmail.com wrote: Hi spark users, I'm using spark SQL to create parquet files on HDFS. I would like to store the avro schema

Re: IndexedRDD

2015-01-13 Thread Jerry Lam
Hi guys, I'm interested in the IndexedRDD too. How many rows in the big table that matches the small table in every run? If the number of rows stay constant, then I think Jem wants the runtime to stay about constant (i.e. ~ 0.6 second for all cases). However, I agree with Andrew. The performance

Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Jerry Lam
Hi spark users, I'm using spark SQL to create parquet files on HDFS. I would like to store the avro schema into the parquet meta so that non spark sql applications can marshall the data without avro schema using the avro parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
using this for one of my projects on a cluster as well. Also, here is a blog that describes how to configure this. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ Guru Medasani gdm...@gmail.com On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. prabsma...@gmail.com wrote: Refer this post http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ Spark + Jupyter + Docker On 18 August 2015 at 21:29, Jerry Lam

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
Hi Nick, I forgot to mention in the survey that ganglia is never installed properly for some reasons. I have this exception every time I launched the cluster: Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/? I downloaded the build but I couldn't find the information related to it. Thank you! Best Regards,

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
for the commits made on Jul 16th. There may be other ways of determining the latest commit. Cheers On Thu, Jul 30, 2015 at 7:39 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I wonder which git commit was used to build the latest master-nightly build found at: http

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula aharipriy...@gmail.com

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone On 11 Aug, 2015, at 11:12 am, saif.a.ell...@wellsfargo.com wrote: I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From:

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at

Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface. Best Regards, Jerry Sent from my iPhone On 9 Aug, 2015, at 5:48 am, Akhil Das

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Jerry Lam
as an example framework for Mesos - thats how I know it. It is surprising to see that the options provided by mesos in this case are less. Tweaking the source code, haven't done it yet but I would love to see what options could be there! On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam chiling

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
Hi spark users and developers, I have been trying to understand how Spark SQL works with Parquet for the couple of days. There is a performance problem that is unexpected using the column pruning. Here is a dummy example: The parquet file has the 3 fields: |-- customer_id: string (nullable =

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote: No, never really resolved the problem, except by

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
similar style off-heap memory mgmt, more planning optimizations *From:* Jerry Lam [mailto:chiling...@gmail.com chiling...@gmail.com] *Sent:* Sunday, July 5, 2015 6:28 PM *To:* Ted Yu *Cc:* Slim Baltagi; user *Subject:* Re: Benchmark results between Flink and Spark Hi guys, I just read

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B nb.nos...@gmail.com wrote: Hello, How do I go about performing the equivalent of the following SQL clause in Spark Streaming? I will be using this on a Windowed DStream.

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
= rdd.reduceByKey((c1, c2) - c1+c2 ); ListTuple2String, Integer output = rdd2.collect(); for (Tuple2?,? tuple : output) { System.out.println( tuple._1() + : + tuple._2() ); } } On Sun, Jul 19, 2015 at 2:28 PM, Jerry Lam chiling...@gmail.com wrote: You mean

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
? -- *From:* Jerry Lam [chiling...@gmail.com] *Sent:* Monday, July 20, 2015 8:27 AM *To:* Jahagirdar, Madhu *Cc:* user; d...@spark.apache.org *Subject:* Re: Spark Mesos Dispatcher Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
rtitions might be using tons of driver memory via the > OutputCommitCoordinator's bookkeeping data structures. > > On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi spark guys, >> >> I think I hit the same issue SPARK-8890 >> https:/

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
) org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267) On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Josh, > >

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
parameters to make it more memory efficient? Best Regards, Jerry On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi guys, > > After waiting for a day, it actually causes OOM on the spark driver. I > configure the driver to have 6GB. Note that I didn't c

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
million files. Not sure why it OOM the driver after the job is marked _SUCCESS in the output folder. Best Regards, Jerry On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Spark users and developers, > > Does anyone encounter any issue when a spark SQL job

[Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Hi Spark users and developers, Anyone experiences issues in setting hadoop configurations after SparkContext is initialized? I'm using Spark 1.5.1. I'm trying to use s3a which requires access and secret key set into hadoop configuration. I tried to set the properties in the hadoop configuration

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Marcelo Vanzin <van...@cloudera.com> wrote: > On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam <chiling...@gmail.com> wrote: > > Anyone experiences issues in setting hadoop configurations after > > SparkContext is initialized? I'm using Spark 1.5.1. > > > > I'm

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
r with that > code. > > On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam <chiling...@gmail.com> wrote: > > Hi Marcelo, > > > > Thanks for the advice. I understand that we could set the configurations > > before creating SparkContext. My question is > > SparkCon

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, Did you read the email I sent few days ago. There are more issues with partitionBy down the road: https://www.mail-archive.com/user@spark.apache.org/msg39512.html Best Regards, Jerry > On Oct 28, 2015, at 4:52 PM,

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
nterfaces.scala:561) >> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) >> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) >> org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) >> org.apache.spark.sql.DataFrameRea

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
Hi Spark users and developers, I read the ticket [SPARK-8578] (Should ignore user defined output committer when appending data) which ignore DirectParquetOutputCommitter if append mode is selected. The logic was that it is unsafe to use because it is not possible to revert a failed job in append

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
gt; > Jerry, > > Thank you for the note. It sounds like you were able to get further than I > have been - any insight? Just a Spark 1.4.1 vs Spark 1.5? > > Regards, > > Bryan Jeffrey > From: Jerry Lam > Sent: ‎10/‎28/‎2015 6:29 PM > To: Bryan Jeffrey > Cc: S

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Jerry Lam
an 1.5.0, you miss some fixes such as SPARK-9952 > > Cheers > >> On Sat, Nov 14, 2015 at 6:35 PM, Jerry Lam <chiling...@gmail.com> wrote: >> Hi spark users and developers, >> >> Have anyone experience the slow startup of a job when it contains a stage >&

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
r. the max-date is likely > to be faster though. > > On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Koert, >> >> You should be able to see if it requires scanning the whole data by >> "explain" the query. The physica

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
xposed. > > On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> it seems to work but i am not sure if its not scanning the whole dataset. >> let me dig into tasks a a bit >> >> On Sun, Nov 1, 2015 at 3:18 PM, Jerry Lam <chili

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
of the physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam <chiling...@gmail.com> wrote: > I agreed the

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, If the partitioned table is implemented properly, I would think "select distinct(date) as dt from table order by dt DESC limit 1" would return the latest dates without scanning the whole dataset. I haven't try it that myself. It would be great if you can report back if this actually

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
park zzhang$ more conf/hive-site.xml > > > > > > > hive.metastore.uris > thrift://zzhang-yarn11:9083 <> > > > > > HW11188:spark zzhang$ > > By the way, I don’t know whether there is any caveat for this walk around. &

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
t that cannot be done by HiveContext? > > Thanks. > > Zhan Zhang > > On Nov 6, 2015, at 10:43 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > >> What is interesting is that pyspark shell works fine with multiple session

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform because the interactive session could be long and might not use Spark for the entire session. It is very wasteful of resources if we used the coarse-grained mode because it keeps resource for the entire session. Therefore,

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jerry Lam
Does Qubole use Yarn or Mesos for resource management? Sent from my iPhone > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan > wrote: > > Qubole - To unsubscribe, e-mail:

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
; I don't see config of skipping the above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being insta

[Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi spark users and developers, Is it possible to disable HiveContext from being instantiated when using spark-shell? I got the following errors when I have more than one session starts. Since I don't use HiveContext, it would be great if I can have more than 1 spark-shell start at the same time.

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
onfig of skipping the above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being instantiated when usin

Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near future as spark is not a database. You might have a better luck looking at databases with spatial indexing support out of the box. Cheers Sent from my iPad On 2015-10-18, at 17:16, Mustafa Elbehery

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Jerry Lam
Can you try setting SPARK_USER at the driver? It is used to impersonate users at the executor. So if you have user setup for launching spark jobs on the executor machines, simply set it to that user name for SPARK_USER. There is another configuration that will prevents jobs being launched with

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer

Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
Hi Spark users and developers, I have a dataframe with the following schema (Spark 1.5.1): StructType(StructField(type,StringType,true), StructField(timestamp,LongType,false)) After I save the dataframe in parquet and read it back, I get the following schema:

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
This is the ticket SPARK-10951 <https://issues.apache.org/jira/browse/SPARK-10951> Cheers~ On Tue, Oct 6, 2015 at 11:33 AM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Burak, > > Thank you for the tip. > Unfortunately it does not work. It throws: > > java.net.

Re: Java vs. Scala for Spark

2015-09-08 Thread Jerry Lam
Hi Bryan, I would choose a language based on the requirements. It does not make sense if you have a lot of dependencies that are java-based components and interoperability between java and scala is not always obvious. I agree with the above comments that Java is much more verbose than Scala in

spark-submit --packages using different resolver

2015-10-01 Thread Jerry Lam
Hi spark users and developers, I'm trying to use spark-submit --packages against private s3 repository. With sbt, I'm using fm-sbt-s3-resolver with proper aws s3 credentials. I wonder how can I add this resolver into spark-submit such that --packages can resolve dependencies from private repo?

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-04 Thread Jerry Lam
Philip, the guy is trying to help you. Calling him silly is a bit too far. He might assume your problem is IO bound which might not be the case. If you need only 4 cores per job no matter what there is little advantage to use spark in my opinion because you can easily do this with just a worker

Re: Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-26 Thread Jerry Lam
mes: > > import org.apache.spark.sql.functions._ > table("purchases").select(explode(df("purchase_items")).as("item")) > > > > On Fri, Sep 25, 2015 at 4:21 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi sparkers, >> >> Anyone know

Re: Spark SQL: Implementing Custom Data Source

2015-09-29 Thread Jerry Lam
h...@gmail.com> wrote: > >> See this thread: >> >> http://search-hadoop.com/m/q3RTttmiYDqGc202 >> >> And: >> >> >> http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources >> >> On Sep 28, 2015, at 8:22 PM, Jerry Lam <c

Spark SQL: Implementing Custom Data Source

2015-09-28 Thread Jerry Lam
Hi spark users and developers, I'm trying to learn how implement a custom data source for Spark SQL. Is there a documentation that I can use as a reference? I'm not sure exactly what needs to be extended/implemented. A general workflow will be greatly helpful! Best Regards, Jerry

Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
Do you have specific reasons to use Ceph? I used Ceph before, I'm not too in love with it especially when I was using the Ceph Object Gateway S3 API. There are some incompatibilities with aws s3 api. You really really need to try it because making the commitment. Did you managed to install it? On

Re: Re: Spark standalone/Mesos on top of Ceph

2015-09-22 Thread Jerry Lam
t; > Best, > Sun. > > -- > fightf...@163.com > > > *From:* Jerry Lam <chiling...@gmail.com> > *Date:* 2015-09-23 09:37 > *To:* fightf...@163.com > *CC:* user <user@spark.apache.org> > *Subject:* Re: Spark standalone/Mesos on t

Spark SQL: Native Support for LATERAL VIEW EXPLODE

2015-09-25 Thread Jerry Lam
Hi sparkers, Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext? I don't want to start up a metastore and derby just because I need LATERAL VIEW EXPLODE. I have been trying but I always get the exception like this: Name: java.lang.RuntimeException Message: [1.68] failure: ``union''

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Jerry Lam
Hi Amit, Have you looked at Amazon EMR? Most people using EMR use s3 for persistency (both as input and output of spark jobs). Best Regards, Jerry Sent from my iPhone > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote: > > > A lot of places in the documentation mention using

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
0:01 AM, Yin Huai <yh...@databricks.com> wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create a JIRA? >> >> Thanks, >> >> Yin >> >> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm n

Re: spark-submit --packages using different resolver

2015-10-06 Thread Jerry Lam
. Could you please try using > the --repositories flag and provide the address: > `$ spark-submit --packages my:awesome:package --repositories > s3n://$aws_ak:$aws_sak@bucket/path/to/repo` > > If that doesn't work, could you please file a JIRA? > > Best, > Burak > > >

Re: spark-ec2 vs. EMR

2015-12-02 Thread Jerry Lam
er. Fixed only last week. Not sure if fixed in all branches >>> >>> 10. I think Amazon will include spark-jobserver to EMR soon. >>> >>> 11. You do not need to be aws expert to start EMR cluster. Users can use >>> EMR web ui to start cluster to run some

Re: ideal number of executors per machine

2015-12-15 Thread Jerry Lam
Hi Veljko, I usually ask the following questions: “how many memory per task?” then "How many cpu per task?” then I calculate based on the memory and cpu requirements per task. You might be surprise (maybe not you, but at least I am :) ) that many OOM issues are actually because of this. Best

  1   2   >