Re: Windows operation orderBy desc

2016-08-24 Thread Selvam Raman
Hi all, i am using window function to find out the latest record using row_number; the hive table is partitioned. when i run the function it runs for 545. what is the reason for 545 task. Thanks, selvam R On Mon, Aug 1, 2016 at 8:09 PM, Mich Talebzadeh wrote: >

Re: spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Tal Grynbaum
I read somewhere that its because s3 has to know the size of the file upfront I dont really understand this, as to why is it ok not to know it for the temp files and not ok for the final files. The delete permission is the minor disadvantage from my side, the worst thing is that i have a

Re: quick question

2016-08-24 Thread Sivakumaran S
You create a websocket object in your spark code and write your data to the socket. You create a websocket object in your dashboard code and receive the data in realtime and update the dashboard. You can use Node.js in your dashboard (socket.io). I am sure there are other ways too. Does that

spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Aseem Bansal
Hi When Spark saves anything to S3 it creates temporary files. Why? Asking this as this requires the the access credentails to be given delete permissions along with write permissions.

Re: Spark MLlib:Collaborative Filtering

2016-08-24 Thread Devi P.V
Thanks a lot.I solved the problem using string indexer. On Wed, Aug 24, 2016 at 3:40 PM, Praveen Devarao wrote: > You could use the string indexer to convert your string userids and > product ids numeric value. http://spark.apache.org/docs/ >

Re: Sqoop vs spark jdbc

2016-08-24 Thread ayan guha
Hi Adding one more lense to it: If we are talking about one-off migration use case, or weekly synch - sqoop would be a better choice. If we are talking about regular data feeding from DB to Hadoop, and doing some transformation in the pipe, spark will do better. On Thu, Aug 25, 2016 at 2:08 PM,

Spark Logging : log4j.properties or log4j.xml

2016-08-24 Thread John Jacobs
One can specify "-Dlog4j.configuration=" or "-Dlog4j.configuration=". Is there any preference to using one over other? All the spark documentation talks about using "log4j.properties" only ( http://spark.apache.org/docs/latest/configuration.html#configuring-logging). So is only "log4j.properties"

Re: Sqoop vs spark jdbc

2016-08-24 Thread Ranadip Chatterjee
This will depend on multiple factors. Assuming we are talking significant volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion performance is the sole consideration (which is true in many production use cases). Sqoop provides some potential optimisations specially around using

Spark Streaming user function exceptions causing hangs

2016-08-24 Thread N B
Hello, We have a Spark streaming application (running Spark 1.6.1) that consumes data from a message queue. The application is running in local[*] mode so driver and executors are in a single JVM. The issue that we are dealing with these days is that if any of our lambda functions throw any

Re: Incremental Updates and custom SQL via JDBC

2016-08-24 Thread Sascha Schmied
Thank you for your answer. I’m using ORC transactional table right now. But i’m not stuck with that. When I send an SQL statement like the following, where old_5sek_agg and new_5sek_agg are registered temp tables, I’ll get an exception in spark. Same without subselect. sqlContext.sql("DELETE

Re: Incremental Updates and custom SQL via JDBC

2016-08-24 Thread Mich Talebzadeh
So you want to push data from Spark streaming to both Hive and SAP HANA tables. Let us take one at a time. Spark writing to Hive table but you need to delete some rows from Hive beforehand? Have you defined your ORC table as ORC transactional or you are just marking them as deleted with two

Incremental Updates and custom SQL via JDBC

2016-08-24 Thread Oldskoola
Hi, I'm building aggregates over Streaming Data. When new data effects previously processed aggregates, I'll need to update the effected rows or delete them before writing the new processed aggregates back to HDFS (Hive Metastore) and a SAP HANA Table. How would you do this, when writing a

Re: Sqoop vs spark jdbc

2016-08-24 Thread Mich Talebzadeh
Personally I prefer Spark JDBC. Both Sqoop and Spark rely on the same drivers. I think Spark is faster and if you have many nodes you can partition your incoming data and take advantage of Spark DAG + in memory offering. By default Sqoop will use Map-reduce which is pretty slow. Remember for

How to compute a net (difference) given a bi-directional stream of numbers using spark streaming?

2016-08-24 Thread kant kodali
Hi Guys, I am new to spark but I am wondering how do I compute the difference given a bidirectional stream of numbers using spark streaming? To put it more concrete say Bank A is sending money to Bank B and Bank B is sending money to Bank A throughout the day such that at any given time we want

Sqoop vs spark jdbc

2016-08-24 Thread Venkata Penikalapati
Team, Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. Sqoop has lot of optimizations to fetch data does spark jdbc also has those ? I'm performing few analytics using spark data for which data is residing in  rdbms.  Please guide me with this.  ThanksVenkata Karthik P 

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Mich Talebzadeh
Hi Richard, can you use analytics functions for this purpose on DF HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-24 Thread kant kodali
What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
Hi Mich, I'd like to gather several statistics per column in order to make analysing data easier. These two statistics are some examples, other statistics I'd like to gather are the variance, the median, several percentiles, etc. We are building a data analysis platform based on Spark, kind

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Jeoffrey Lim
To clarify my earlier statement, I will continue working on Maelstrom as an alternative to official Spark integration with Kafka and keep the KafkaRDDs + Consumers as it is - until I find the official Spark Kafka more stable and resilient to Kafka broker issues/failures (reason I have infinite

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-24 Thread Arun Luthra
Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra wrote: > Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD joins. > > On Aug

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Jeoffrey Lim
Hi Cody, thank you for pointing out sub-millisecond processing, it is an "exaggerated" term :D I simply got excited releasing this project, it should be: "millisecond stream processing at the spark level". Highly appreciate the info about latest Kafka consumer. Would need to get up to speed about

Re: Is "spark streaming" streaming or mini-batch?

2016-08-24 Thread Mich Talebzadeh
Is "spark streaming" streaming or mini-batch? I look at something Like Complex Event Processing (CEP) which is a leader use case for data streaming (and I am experimenting with Spark for it) and in the realm of CEP there is really no such thing as continuous data streaming. The point is that when

Re: a question about LBFGS in Spark

2016-08-24 Thread DB Tsai
Hi Lingling, I think you don't properly subscribe to mailing list yet, so I +cc to the mailing list. The mllib package is deprecated, and we no longer maintain it. The reason why it designed in this way is because of backward compatibility. In the original design, updater also has the logic of

Re: spark-jdbc impala with kerberos using yarn-client

2016-08-24 Thread Marcelo Vanzin
I believe the Impala JDBC driver is mostly the same as the Hive driver, but I could be wrong. In any case, the right place to ask that question is the Impala groups (see http://impala.apache.org/). On a side note, it is a little odd that you're trying to read data from Impala using JDBC, instead

Re: Spark 2.0 with Kafka 0.10 exception

2016-08-24 Thread Srikanth
Thanks Cody. Setting poll timeout helped. Our network is fine but brokers are not fully provisioned in test cluster. But there isn't enough load to max out on broker capacity. Curious that kafkacat running on the same node doesn't have any issues. Srikanth On Tue, Aug 23, 2016 at 9:52 PM, Cody

RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-24 Thread Cinquegrana, Piero
Hi Spark experts, I was able to get around the broadcast issue by using a global assignment '<<-' instead of reading the data locally. However, I still get the following error: Error in writeBin(batch, con, endian = "big") : attempting to add too many elements to raw vector Pseudo code:

Re: GraphFrames 0.2.0 released

2016-08-24 Thread Maciej Bryński
Hi, Do you plan to add tag for this release on github ? https://github.com/graphframes/graphframes/releases Regards, Maciek 2016-08-17 3:18 GMT+02:00 Jacek Laskowski : > Hi Tim, > > AWESOME. Thanks a lot for releasing it. That makes me even more eager > to see it in Spark's

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Mich Talebzadeh
Hi Richard, What is the business use case for such statistics? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
Before 2.0, Spark has built-in support for caching RDD data on Tachyon(Alluxio), but that support is removed since 2.0. In either case, Spark does not support writing shuffle data to Tachyon. Since Alluxio has experimental support for FUSE

Re: Best way to calculate intermediate column statistics

2016-08-24 Thread Bedrytski Aliaksandr
Hi Richard, these intermediate statistics should be calculated from the result of the calculation or during the aggregation? If they can be derived from the resulting dataframe, why not to cache (persist) that result just after the calculation? Then you may aggregate statistics from the cached

Re: Dataframe write to DB , loosing primary key index & data types.

2016-08-24 Thread sujeet jog
There was a inherent bug in my code which did this, On Wed, Aug 24, 2016 at 8:07 PM, sujeet jog wrote: > Hi, > > I have a table with definition as below , when i write any records to this > table, the varchar(20 ) gets changes to text, and it also losses the > primary key

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Cody Koeninger
Yes, spark-streaming-kafka-0-10 uses the new consumer. Besides pre-fetching messages, the big reason for that is that security features are only available with the new consumer. The Kafka project is at release 0.10.0.1 now, they think most of the issues with the new consumer have been ironed

Best way to calculate intermediate column statistics

2016-08-24 Thread Richard Siebeling
Hi, what is the best way to calculate intermediate column statistics like the number of empty values and the number of distinct values each column in a dataset when aggregating of filtering data next to the actual result of the aggregate or the filtered data? We are developing an application in

Re: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
Based on my limited knowledge of Tachyon (Alluxio), it only provides a layer of Hadoop compatible FileSystem API, which means it cannot be used in shuffle data store. If it can be mounted as an OS supported FS layer, like NFS or Fuse, then it can be used for shuffle data store. But never neglect

Dataframe write to DB , loosing primary key index & data types.

2016-08-24 Thread sujeet jog
Hi, I have a table with definition as below , when i write any records to this table, the varchar(20 ) gets changes to text, and it also losses the primary key index, any idea how to write data with spark SQL without loosing the primary key index & data types. ? MariaDB [analytics]> show

Re: Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread tony....@tendcloud.com
Hi, Saisai and Rui, Thanks a lot for your answer. Alluxio tried to work as the middle layer between storage and Spark, so is it possible to use Alluxio to resolve the issue? We want to have 1 SSD for every datanode and use Alluxio to manage mem,ssd and hdd. Thanks and Regards, Tony

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
Yes, I also tried FUSE before, it is not stable and I don’t recommend it > On Aug 24, 2016, at 22:15, Saisai Shao wrote: > > Also fuse is another candidate (https://wiki.apache.org/hadoop/MountableHDFS > ), but not so stable

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
Also fuse is another candidate (https://wiki.apache.org/hadoop/MountableHDFS), but not so stable as I tried before. On Wed, Aug 24, 2016 at 10:09 PM, Sun Rui wrote: > For HDFS, maybe you can try mount HDFS as NFS. But not sure about the > stability, and also there is

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Sun Rui
For HDFS, maybe you can try mount HDFS as NFS. But not sure about the stability, and also there is additional overhead of network I/O and replica of HDFS files. > On Aug 24, 2016, at 21:02, Saisai Shao wrote: > > Spark Shuffle uses Java File related API to create local

Re: Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread Saisai Shao
Spark Shuffle uses Java File related API to create local dirs and R/W data, so it can only be worked with OS supported FS. It doesn't leverage Hadoop FileSystem API, so writing to Hadoop compatible FS is not worked. Also it is not suitable to write temporary shuffle data into distributed FS, this

Can we redirect Spark shuffle spill data to HDFS or Alluxio?

2016-08-24 Thread tony....@tendcloud.com
Hi, All, When we run Spark on very large data, spark will do shuffle and the shuffle data will write to local disk. Because we have limited capacity at local disk, the shuffled data will occupied all of the local disk and then will be failed. So is there a way we can write the shuffle spill

Re: Spark Streaming application failing with Token issue

2016-08-24 Thread Jacek Laskowski
Hi Steve, Thanks a lot for such an elaborative email (though it brought more questions than answers but it's because I'm new to YARN in general and Kerberos/tokens/tickets in particular). Thanks also for liking my notes. I'm very honoured to hear it from you. I value your work with

Re: work with russian letters

2016-08-24 Thread Jacek Laskowski
Hi Alex, Mind showing the schema? Could you AS the columns right after you load the dataset? What are the current problems you're dealing with? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Re: Spark MLlib:Collaborative Filtering

2016-08-24 Thread Praveen Devarao
You could use the string indexer to convert your string userids and product ids numeric value. http://spark.apache.org/docs/latest/ml-features.html#stringindexer Thanking You - Praveen Devarao IBM India Software

Re: Is "spark streaming" streaming or mini-batch?

2016-08-24 Thread Steve Loughran
On 23 Aug 2016, at 17:58, Mich Talebzadeh > wrote: In general depending what you are doing you can tighten above parameters. For example if you are using Spark Streaming for Anti-fraud detection, you may stream data in at 2 seconds

work with russian letters

2016-08-24 Thread AlexModestov
Hello everybody, I want to work with DataFrames where some columns have a string type. And there are russian letters. Russian letters are incorrect in the text. Could you help me how I should work with them? Thanks. -- View this message in context:

Re: Spark Streaming application failing with Token issue

2016-08-24 Thread Steve Loughran
> On 23 Aug 2016, at 11:26, Jacek Laskowski wrote: > > Hi Steve, > > Could you share your opinion on whether the token gets renewed or not? > Is the token going to expire after 7 days anyway? There's Hadoop service tokens, and Kerberos tickets. They are similar-ish, but not

Best range of parameters for grid search?

2016-08-24 Thread Adamantios Corais
I would like to run a naive implementation of grid search with MLlib but I am a bit confused about choosing the 'best' range of parameters. Apparently, I do not want to waste too much resources for a combination of parameters that will probably not give a better mode. Any suggestions from your

Re: dynamic allocation in Spark 2.0

2016-08-24 Thread Saisai Shao
This looks like Spark application is running into a abnormal status. From the stack it means driver could not send requests to AM, can you please check if AM is reachable or are there any other exceptions beside this one. >From my past test, Spark's dynamic allocation may run into some corner

Re: Spark MLlib:Collaborative Filtering

2016-08-24 Thread glen
Hash it to int On 2016-08-24 16:28 , Devi P.V Wrote: Hi all, I am newbie in collaborative filtering.I want to implement collaborative filtering algorithm(need to find top 10 recommended products) using Spark and Scala.I have a rating dataset where userID & ProductID are String type.

Re: DataFrame Data Manipulation - Based on a timestamp column Not Working

2016-08-24 Thread Bedrytski Aliaksandr
Hi Subhajit, you may try to use sql queries instead of helper methods: > sales_order_base_dataFrame.registerTempTable("sales_orders") > > val result = sqlContext.sql(""" > SELECT * > FROM sales_orders > WHERE unix_timestamp(SCHEDULE_SHIP_DATE,'__-MM-_dd_') >= >

Spark MLlib:Collaborative Filtering

2016-08-24 Thread Devi P.V
Hi all, I am newbie in collaborative filtering.I want to implement collaborative filtering algorithm(need to find top 10 recommended products) using Spark and Scala.I have a rating dataset where userID & ProductID are String type. UserID ProductID Rating b3a68043-c1

Future of GraphX

2016-08-24 Thread mas
Hi, I am wondering if there is any current work going on optimizations of GraphX? I am aware of GraphFrames that is built on Data frame. However, is there any plane to build GraphX's version on newer Spark APIs, i.e., Datasets or Spark 2.0? Furthermore, Is there any plan to incorporate Graph

dynamic allocation in Spark 2.0

2016-08-24 Thread Shane Lee
Hello all, I am running hadoop 2.6.4 with Spark 2.0 and I have been trying to get dynamic allocation to work without success. I was able to get it to work with Spark 16.1 however. When I issue the commandspark-shell --master yarn --deploy-mode client this is the error I see: 16/08/24 00:05:40