Re: Redshift Vs Spark SQL (Thrift)

2016-09-22 Thread Jörn Franke
Depends what your use case is. A generic benchmark does not make sense, because they are different technologies for different purposes. > On 23 Sep 2016, at 06:09, ayan guha wrote: > > Hi > > Is there any benchmark or point of view in terms of pros and cons between AWS > Redshift vs Spark SQL

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
A JVM is always going to use a little >> >> off-heap memory by itself, so setting a max heap size of 2GB means the >> >> JVM process may use a bit more than 2GB of memory. With an off-heap >> >> intensive app like Spark it can be a lot more. >> >> &

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
park it can be a lot more. > > There's a built-in 10% overhead, so that if you ask for a 3GB executor > it will ask for 3.3GB from YARN. You can increase the overhead. > > On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke > wrote: > > All off-heap memory is still managed b

Re: Memory usage by Spark jobs

2016-09-22 Thread Jörn Franke
You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed. Then, yo

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Jörn Franke
All off-heap memory is still managed by the JVM process. If you limit the memory of this process then you limit the memory. I think the memory of the JVM process could be limited via the xms/xmx parameter of the JVM. This can be configured via spark options for yarn (be aware that they are diffe

Re: Sqoop vs spark jdbc

2016-09-21 Thread Jörn Franke
I think there might be still something messed up with the classpath. It complains in the logs about deprecated jars and deprecated configuration files. > On 21 Sep 2016, at 22:21, Mich Talebzadeh wrote: > > Well I am left to use Spark for importing data from RDBMS table to Hadoop. > > You may

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Jörn Franke
Do you mind sharing what your software does? What is the input data size? What is the spark version and apis used? How many nodes? What is the input data format? Is compression used? > On 21 Sep 2016, at 13:37, Trinadh Kaja wrote: > > Hi all, > > how to increase spark performance ,i am using

Re: filling missing values in a sequence

2016-09-18 Thread Jörn Franke
I am not sure what you try to achieve here. Can you please tell us what the goal of the program is. Maybe with some example data? Besides this, I have the feeling that it will fail once it is not used in a single node scenario due to the reference to the global counter variable. Also unclear wh

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Jörn Franke
Ignite has a special cache for HDFS data (which is not a Java cache), for rdds etc. So you are right it is in this sense very different. Besides caching, from what I see from data scientists is that for interactive queries and models evaluation they anyway do not browse the complete data. Even

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Jörn Franke
In Tableau you can use the in-memory facilities of the Tableau server. As said, Apache Ignite could be one way. You can also use it to make Hive tables in-memory. While reducing IO can make sense, I do not think you will receive in production systems so much difference (at least not 20x). If the

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Jörn Franke
Hi, I recommend that the third party application puts an empty file with the same filename as the original file, but the extension ".uploaded". This is an indicator that the file has been fully (!) written to the fs. Otherwise you risk only reading parts of the file. Then, you can have a file sy

Re: Streaming - lookup against reference data

2016-09-14 Thread Jörn Franke
Hmm is it just a lookup and the values are small? I do not think that in this case redis needs to be installed on each worker node. Redis has a rather efficient protocol. Hence one or a few dedicated redis nodes probably fit your purpose more then needed. Just try to reuse connections and do not

Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
Hi, An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also some reliability mechanisms and has been explicitly designed for import/export and is tested. Not sure if i would go for spark streaming if the use case is only storing, but I do not have the full pict

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
It could be that by using the rdd it converts the data from the internal format to Java objects (-> much more memory is needed), which may lead to spill over to disk. This conversion takes a lot of time. Then, you need to transfer these Java objects via network to one single node (repartition ..

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
Hi, DataFrames are more efficient if you have Tungsten activated as the underlying processing engine (normally by default). However, this only speeds up processing , saving as an io-bound operation not necessarily. What is exactly slow? The write? You could use myDF.write.save().write... Howe

Re: Best way to read XML data from RDD

2016-08-19 Thread Jörn Franke
I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)

Re: How to Improve Random Forest classifier accuracy

2016-08-18 Thread Jörn Franke
Depends on your data... How did you split training and test set? How does the model fit to the data? You could try of course also to have more data to fed into the model Have you considered alternative machine learning models? I do not think this is a Spark problem, but you should ask the mac

Re: Spark Yarn executor container memory

2016-08-15 Thread Jörn Franke
Both are part of the heap. > On 16 Aug 2016, at 04:26, Lan Jiang wrote: > > Hello, > > My understanding is that YARN executor container memory is based on > "spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one > is for heap memory and second one is for offheap memory.

Re: how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Jörn Franke
Depends on the size of the arrays, but is it what you want to achieve similar to a join? > On 15 Aug 2016, at 20:12, Eric Ho wrote: > > Hi, > > I've two nested-for loops like this: > > for all elements in Array A do: > > for all elements in Array B do: > > compare a[3] with b[4] see if they

Re: Does Spark SQL support indexes?

2016-08-13 Thread Jörn Franke
Use a format that has built-in indexes, such as Parquet or Orc. Do not forget to sort the data on the columns that your filter on. > On 14 Aug 2016, at 05:03, Taotao.Li wrote: > > > hi, guys, does Spark SQL support indexes? if so, how can I create an index > on my temp table? if not, how can

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke
ta or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >>> On 2 Augu

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke
er property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 2 August 2016 at 23:10, Ted Yu wrote: >

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke
If you need to use single inserts, updates, deletes, select why not use hbase with Phoenix? I see it as complementary to the hive / warehouse offering > On 02 Aug 2016, at 22:34, Mich Talebzadeh wrote: > > Hi, > > I decided to create a catalog table in Hive ORC and transactional. That table

Re: Custom Image RDD and Sequence Files

2016-07-28 Thread Jörn Franke
Why don't you write your own Hadoop FileInputFormat. It can be used by Spark... > On 28 Jul 2016, at 20:04, jtgenesis wrote: > > Hey all, > > I was wondering what the best course of action is for processing an image > that has an involved internal structure (file headers, sub-headers, image > d

Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Jörn Franke
is by Hortonworks, so battle of file format continues... >> >> Sent from my iPhone >> >>> On Jul 27, 2016, at 4:54 PM, janardhan shetty >>> wrote: >>> >>> Seems like parquet format is better comparatively to orc when the dataset >>> is

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
gt;>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo >>>>> [2]. >>>>> >>>>> Other than this presentation [3], do you guys know any other benchmark? >>>>> >>>>> [1]https://parquet

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
gt;>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo >>>>> [2]. >>>>> >>>>> Other than this presentation [3], do you guys know any other benchmark? >>>>> >>>>> [1]https://parquet

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Jörn Franke
I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. In the end you have to check which application you are using and do some tests (with correc

Re: Little idea needed

2016-07-19 Thread Jörn Franke
Well as far as I know there is some update statement planned for spark, but not sure which release. You could alternatively use Hive+Orc. Another alternative would be to add the deltas in a separate file and when accessing the table filtering out the double entries. From time to time you could

Re: Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

2016-07-15 Thread Jörn Franke
I am not sure if I exactly understand your use case, but for my Hadoop/Spark format that reads the Bitcoin blockchain I extend from FileInputFormat. I use the default split mechanism. This could mean that I split in the middle of a bitcoin block, which is no issue, because the first split can r

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Jörn Franke
I think the comparison with Oracle rdbms and oracle times ten is not so good. There are times when the in-memory database of Oracle is slower than the rdbms (especially in case of Exadata) due to the issue that in-memory - as in Spark - means everything is in memory and everything is always pro

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Jörn Franke
er way getting the same result. However, my concerns: >> >> Spark has a wide user base. I judge this from Spark user group traffic >> TEZ user group has no traffic I am afraid >> LLAP I don't know >> Sounds like Hortonworks promote TEZ and Cloudera does not wa

Re: Processing json document

2016-07-08 Thread Jörn Franke
> "lastName":"Doe" > }, > { > "firstName":"Anna", >"lastName":"Smith" > }, > { >"firstName":"Peter", > "lastName":"Jones"

Re: Memory grows exponentially

2016-07-08 Thread Jörn Franke
Memory fragmentation? Quiet common with in-memory systems. > On 08 Jul 2016, at 08:56, aasish.kumar wrote: > > Hello everyone: > > I have been facing a problem associated spark streaming memory. > > I have been running two Spark Streaming jobs concurrently. The jobs read > data from Kafka with

Re: Processing json document

2016-07-06 Thread Jörn Franke
This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format. > On 07 Jul 2016,

Re: Using R code as part of a Spark Application

2016-06-29 Thread Jörn Franke
Still you need sparkR > On 29 Jun 2016, at 19:14, John Aherne wrote: > > Microsoft Azure has an option to create a spark cluster with R Server. MS > bought RevoScale (I think that was the name) and just recently deployed it. > >> On Wed, Jun 29, 2016 at 10:53 AM, Xinh Huynh wrote: >> There is

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
eh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destru

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
eh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Jörn Franke
Bzip2 is splittable for text files. Btw in Orc the question of splittable does not matter because each stripe is compressed individually. Have you tried tez? As far as I recall (at least it was in the first version of Hive) mr uses for order by a single reducer which is a bottleneck. Do you

Re: Difference between Dataframe and RDD Persisting

2016-06-26 Thread Jörn Franke
Dataframe uses a more efficient binary representation to store and persist data. You should go for that one in most of the cases. Rdd is slower. > On 27 Jun 2016, at 07:54, Brandon White wrote: > > What is the difference between persisting a dataframe and a rdd? When I > persist my RDD, the UI

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke
e at https://twitter.com/jaceklaskowski > > >> On Fri, Jun 24, 2016 at 10:14 AM, Jörn Franke wrote: >> I would push the Spark people to provide equivalent functionality . In the >> end it is a deserialization/serialization process which should not be done >&g

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke
I would push the Spark people to provide equivalent functionality . In the end it is a deserialization/serialization process which should not be done back and forth because it is one of the more costly aspects during processing. It needs to convert Java objects to a binary representation. It is

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
cessing as we do get 300k > messages per sec , so lookup will slow down. > > Thanks > Sandesh > >> On Wed, Jun 22, 2016 at 3:28 PM, Jörn Franke wrote: >> >> Spark Streamig does not guarantee exactly once for output action. It means >> that one item is only

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
Spark Streamig does not guarantee exactly once for output action. It means that one item is only processed in an RDD. You can achieve at most once or at least once. You could however do at least once (via checkpoing) and record which messages have been proceed (some identifier available?) and do

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Jörn Franke
I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Jörn Franke
If you insert the data sorted then there is not need to bucket the data. You can even create an index in Spark. Simply set the outputformat configuration orc.create.index = true > On 20 Jun 2016, at 09:10, Mich Talebzadeh wrote: > > Right, you concern is that you expect storeindex in ORC fil

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Jörn Franke
I agree here. However it depends always on your use case ! Best regards > On 16 Jun 2016, at 04:58, Gourav Sengupta wrote: > > Hi Mahender, > > please ensure that for dimension tables you are enabling the broadcast > method. You must be able to see surprising gains @12x. > > Overall I th

Re: Is that normal spark performance?

2016-06-15 Thread Jörn Franke
What Volume do you have? Why do not you use the corresponding Cassandra functionality directly? If you do it once and not iteratively in-memory you cannot expect so much improvement > On 15 Jun 2016, at 16:01, nikita.dobryukha wrote: > > We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster

Re: Suggestions on Lambda Architecture in Spark

2016-06-14 Thread Jörn Franke
You do not describe use cases, but technologies. First be aware on your needs and then check technologies. Otherwise nobody can help you properly and you will end up with an inefficient stack for your needs. > On 14 Jun 2016, at 00:52, KhajaAsmath Mohammed > wrote: > > Hi, > > In my current

Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
ブ大好きな人ぜひフォローしてください > 固定ツイートお願いします > ラブライブに出会えて良かった! > 9人のみんなのこと忘れない > #LoveLiveforever > #ラブライバーと繋がりたいRT https://t.co/kITPDLER9x > 07114803986434/photo/1$738659292685979648 > :13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg > : 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集 > h

Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
nterprise search server with a REST-like API. You >>>> put documents in it (called "indexing") via JSON, XML, CSV or binary over >>>> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary >>>> results. >>>> >>>> thanks >>>> &g

Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
;> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> >>> On 7 June 2016 at 13:38, Jörn Franke wrote: >>> Solr

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
CSV or binary over HTTP. > You query it via HTTP GET and receive JSON, XML, CSV or binary results. > > thanks > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpr

Re: Advice on Scaling RandomForest

2016-06-07 Thread Jörn Franke
Before hardware optimization there is always software optimization. Are you using dataset / dataframe? Are you using the right data types ( eg int where int is appropriate , try to avoid string and char etc) Do you extract only the stuff needed? What are the algorithm parameters? > On 07 Jun 201

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
presume this is a typical question. > > You mentioned Spark ml (machine learning?) . Is that something viable? > > Cheers > > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw &g

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Spark ml Support Vector machines or neural networks could be candidates. For unstructured learning it could be clustering. For doing a graph analysis On the followers you can easily use Spark Graphx Keep in mind that each tweet contains a lot of meta data (location, followers etc) that is more or

Re: twitter data analysis

2016-06-03 Thread Jörn Franke
Or combine both! It is possible with Spark Streaming to combine streaming data and on HDFS. In the end it always depends what you want to do and when you need what. > On 03 Jun 2016, at 10:26, Mich Talebzadeh wrote: > > I use twitter data with spark streaming to experiment with twitter data.

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
an email to Hive user group to see anyone has managed to >>> built a vendor independent version. >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>

Re: Query related to spark cluster

2016-05-29 Thread Jörn Franke
Well if you require R then you need to install it (including all additional packages) on each node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access them. > On 30 May 2

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
h TEZ) or use Impala instead of Hive > etc as I am sure you already know. > > Cheers, > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com &

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Jörn Franke
on use case. >> >> HTH >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> >

Re: Pros and Cons

2016-05-25 Thread Jörn Franke
final sentence about this. Both systems develop and change. > On 25 May 2016, at 22:14, Reynold Xin wrote: > > >> On Wed, May 25, 2016 at 9:52 AM, Jörn Franke wrote: >> Spark is more for machine learning working iteravely over the whole same >> dataset in memory. Ad

Re: Pros and Cons

2016-05-25 Thread Jörn Franke
Hive has a little bit more emphasis on the case that your data that is queried is much bigger than available memory or when you need to query many different small data subsets or recently interactively queries (llap etc.). Spark is more for machine learning working iteravely over the whole sa

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Fuzzy match logic. > > How can use map/reduce operations across 2 rdds ? > > Thanks, > Padma Ch > >> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke wrote: >> >> Alternatively depending on the exact use case you may employ solr on Hadoop >> for text

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Alternatively depending on the exact use case you may employ solr on Hadoop for text analytics > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as {"padma","hihi","chch","priya"}. For every string rdd A i nee

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
No this is not needed, look at the map / reduce operations and the standard spark word count > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need >

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
What is the use case of this ? A Cartesian product is by definition slow in any system. Why do you need this? How long does your application take now? > On 25 May 2016, at 12:42, Priya Ch wrote: > > I tried > dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even > this i

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Jörn Franke
Hi Mich, I think these comparisons are useful. One interesting aspect could be hardware scalability in this context. Additionally different type of computations. Furthermore, one could compare Spark and Tez+llap as execution engines. I have the gut feeling that each one can be justified by di

Re: Spark for offline log processing/querying

2016-05-22 Thread Jörn Franke
Do you want to replace ELK by Spark? Depending on your queries you could do as you proposed. However, many of the text analytics queries will probably be much faster on ELK. If your queries are more interactive and not about batch processing then it does not make so much sense. I am not sure why

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Jörn Franke
14000 partitions seem to be way too many to be performant (except for large data sets). How much data does one partition contain? > On 22 May 2016, at 09:34, SRK wrote: > > Hi, > > In my Spark SQL query to insert data, I have around 14,000 partitions of > data which seems to be causing memory

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Jörn Franke
What is the motivation to use such an old version of Hive? This will lead to less performance and other risks. > On 21 May 2016, at 01:57, "kali.tumm...@gmail.com" > wrote: > > Hi All , > > Is there a way to ask spark and spark-sql to use Hive 0.14 version instead > of inbuilt hive 1.2.1. >

Re: Load Table as DataFrame

2016-05-17 Thread Jörn Franke
Do you have the full source code? Why do you convert a data frame to rdd - this does not make sense to me? > On 18 May 2016, at 06:13, Mohanraj Ragupathiraj wrote: > > I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million > rows. From the DataFrame I created an RDD of J

Re: How big the spark stream window could be ?

2016-05-09 Thread Jörn Franke
I do not recommend large windows. You can have small windows, store the data and then do the reports for one hour or one day on stored data. > On 09 May 2016, at 05:19, "kramer2...@126.com" wrote: > > We have some stream data need to be calculated and considering use spark > stream to do it. >

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Jörn Franke
Look at lambda architecture. What is the motivation of your migration? > On 04 May 2016, at 03:29, Tapan Upadhyay wrote: > > Hi, > > We are planning to move our adhoc queries from teradata to spark. We have > huge volume of queries during the day. What is best way to go about it - > > 1) Re

Re: Consume WebService in Spark

2016-05-02 Thread Jörn Franke
It is in Spark not different compared to another program. However a web service and json is probably not very suitable for large data volumes. > On 03 May 2016, at 04:45, KhajaAsmath Mohammed > wrote: > > Hi, > > I am working on a project to pull data from sprinklr for every 15 minutes and >

Re: Performance benchmarking of Spark Vs other languages

2016-05-02 Thread Jörn Franke
Hallo, Spark is a general framework for distributed in-memory processing. You can always write a highly-specified piece of code which is faster than Spark, but then it can do only one thing and if you need something else you will have to rewrite everything from scratch . This is why Spark is be

Re: Reading from Amazon S3

2016-05-02 Thread Jörn Franke
You See oversimplifying here and some of your statements are not correct. There are also other aspects to consider. Finally, it would be better to support him with the problem, because Spark supports Java. Java and Scala run on the same underlying JVM. > On 02 May 2016, at 17:42, Gourav Sengupt

Re: slow SQL query with cached dataset

2016-04-25 Thread Jörn Franke
I do not know your data, but it looks that you have too many partitions for such a small data set. > On 26 Apr 2016, at 00:47, Imran Akbar wrote: > > Hi, > > I'm running a simple query like this through Spark SQL: > > sqlContext.sql("SELECT MIN(age) FROM data WHERE country = 'GBR' AND > dt_y

Re: Call Spark package API from R

2016-04-25 Thread Jörn Franke
You can call any Java/scala library from R using the package rJava > On 25 Apr 2016, at 19:16, ankur.jain wrote: > > Hello Team, > > Is there any way to call spark code (scala/python) from R? > I want to use Cloudera spark-ts api with SparkR, if anyone had used that > please let me know. > >

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Jörn Franke
Well it could also depend on the receiving database. You should also check the executors. Updating to the latest version of the JDBC driver and JDK8, if supported by JDBC driver, could help. > On 20 Apr 2016, at 00:14, Jonathan Gray wrote: > > Hi, > > I'm trying to write ~60 million rows from

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Jörn Franke
Python can access the JVM - this how it interfaces with Spark. Some of the components do not have a wrapper fro the corresponding Java Api yet and thus are not accessible in Python. Same for elastic search. You need to write a more or less simple wrapper. > On 20 Apr 2016, at 09:53, "kramer2...

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Jörn Franke
I do not think there is a simple how to for this. First you need to be clear of volumes in storage, in-transit and in-processing. Then you need to be aware of what kind of queries you want to do. Your assumption of milliseconds for he expected data volumes currently seem to be unrealistic. Howev

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Jörn Franke
I think the easiest would be to use a Hadoop Windows distribution, such as Hortonworks. However, the Linux version of Hortonworks is a little bit more advanced. > On 18 Apr 2016, at 14:13, My List wrote: > > Deepak, > > The following could be a very dumb questions so pardon me for the same. >

Re: Apache Flink

2016-04-18 Thread Jörn Franke
What is your exact set of requirements for algo trading? Is it react in real-time or analysis over longer time? In the first case, I do not think a framework such as Spark or Flink makes sense. They are generic, but in order to compete with other usually custom developed highly - specialized en

Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke
You could also explore the in-memory database of 12c . However, I am not sure how beneficial it is for Oltp scenarios. I am excited to see how the performance will be on hbase as a hive metastore. Nevertheless, your results on Oracle/SSD will be beneficial for the community. > On 17 Apr 2016,

Re: orc vs parquet aggregation, orc is really slow

2016-04-16 Thread Jörn Franke
Generally a recommendation (besides the issue) - Do not put dates as String. I recommend here to make them ints. It will be in both cases much faster. It could be that you load them differently in the tables. Generally for these tables you should insert them in both cases sorted into the tables

Re: Can this performance be improved?

2016-04-14 Thread Jörn Franke
You could use a different format and the dataset or dataframe instead of rdd. > On 14 Apr 2016, at 23:21, Bibudh Lahiri wrote: > > Hi, > As part of a larger program, I am extracting the distinct values of some > columns of an RDD with 100 million records and 4 columns. I am running Spark >

Re: Spark replacing Hadoop

2016-04-14 Thread Jörn Franke
I do not think so. Hadoop provides an ecosystem in which you can deploy different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive, solr... I observe also that commercial analytical tools use one or more of these engines to execute their code in a distributed fashion. You need this

Re: Sqoop on Spark

2016-04-14 Thread Jörn Franke
a. > > Why is the discussion about using anything other than SQOOP still so > wonderfully on? > > > Regards, > Gourav > >> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke wrote: >> Actually I was referring to have a an external table in Oracle, which is >> use

Re: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Jörn Franke
Is the host in /etc/hosts ? > On 13 Apr 2016, at 07:28, Amit Singh Hora wrote: > > I am trying to access directory in Hadoop from my Spark code on local > machine.Hadoop is HA enabled . > > val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]") > val sc=new SparkContext(conf)

Re: Sqoop on Spark

2016-04-11 Thread Jörn Franke
y from it… you can do a very simple bulk load/unload process. However > you need to know the file’s format. > > Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. > > As I and other posters to this thread have alluded to… this would be a block > bulk load/u

Re: Sqoop on Spark

2016-04-10 Thread Jörn Franke
ell. It is using JDBC for each connection between data-nodes and their >>>>> AMP (compute) nodes. There is an additional layer that coordinates all of >>>>> it. >>>>> I know Oracle has a similar technology I've used it and had to supply the >&

Re: Sqoop on Spark

2016-04-06 Thread Jörn Franke
. ;-) > > Just saying. ;-) > > -Mike > >> On Apr 5, 2016, at 10:44 PM, Jörn Franke wrote: >> >> I do not think you can be more resource efficient. In the end you have to >> store the data anyway on HDFS . You have a lot of development effort for >&

Re: Sqoop on Spark

2016-04-05 Thread Jörn Franke
ng ingestion, if possible. Also, I can then use Spark stand alone cluster > to ingest, even if my hadoop cluster is heavily loaded. What you guys think? > >> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke wrote: >> Why do you want to reimplement something which is already there? &g

Re: Sqoop on Spark

2016-04-05 Thread Jörn Franke
Why do you want to reimplement something which is already there? > On 06 Apr 2016, at 06:47, ayan guha wrote: > > Hi > > Thanks for reply. My use case is query ~40 tables from Oracle (using index > and incremental only) and add data to existing Hive tables. Also, it would be > good to have an

Re: Hive on Spark engine

2016-03-26 Thread Jörn Franke
If you check the newest Hortonworks distribution then you see that it generally works. Maybe you can borrow some of their packages. Alternatively it should be also available in other distributions. > On 26 Mar 2016, at 22:47, Mich Talebzadeh wrote: > > Hi, > > I am running Hive 2 and now Spar

Re: Forcing data from disk to memory

2016-03-25 Thread Jörn Franke
I am not 100% sure of the root cause, but if you need rdd caching then look at Apache Ignite or similar. > On 24 Mar 2016, at 16:22, Daniel Imberman wrote: > > Hi Takeshi, > > Thank you for getting back to me. If this is not possible then perhaps you > can help me with the root problem that c

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and row level. > The fi

Re: The build-in indexes in ORC file does not work.

2016-03-18 Thread Jörn Franke
minal_type = 25080; > select * from gprs where terminal_type = 25080; > > In the gprs table, the "terminal_type" column's value is in [0, 25066] > > Joseph > > From: Jörn Franke > Date: 2016-03-16 19:26 > To: Joseph > CC: user; user > Subject: Re

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Jörn Franke
I am not sure about this. At least Hortonworks provides its distribution with Hive and Spark 1.6 > On 14 Mar 2016, at 09:25, Mich Talebzadeh wrote: > > I think the only version of Spark that works OK with Hive (Hive on Spark > engine) is version 1.3.1. I also get OOM from time to time and have

<    1   2   3   4   5   6   >