Re: A naive ML question

2018-04-28 Thread Jörn Franke
What do you mean by “how it evolved over time” ? A transaction describes basically an action at a certain point of time. Do you mean how a financial product evolved over time given a set of a transactions? > On 28. Apr 2018, at 12:46, kant kodali wrote: > > Hi All, > > I

Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread Jörn Franke
What is your use case? > On 23. Apr 2018, at 23:27, kant kodali wrote: > > Hi All, > > Is it ok to make I/O calls in UDF? other words is it a standard practice? > > Thanks! - To unsubscribe e-mail:

Re: Testing spark streaming action

2018-04-10 Thread Jörn Franke
Run it as part of integration testing, you can still use scala test but with a different sub folder (it or integrationtest) instead of test. Within integrationtest you create a local Spark server that has also accumulators. > On 10. Apr 2018, at 17:35, Guillermo Ortiz

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Jörn Franke
Probably network / shuffling cost? Or broadcast variables? Can you provide more details what you do and some timings? > On 9. Apr 2018, at 07:07, Junfeng Chen wrote: > > I have wrote an spark streaming application reading kafka data and convert > the json data to parquet

Re: Does joining table in Spark multiplies selected columns of smaller table?

2018-04-08 Thread Jörn Franke
What do you mean the value is very large in t2? How large? What is it? You could put the large data in separate files on HDFS and just maintain a file name in the table. > On 8. Apr 2018, at 19:52, Vitaliy Pisarev > wrote: > > I have two tables in spark: > >

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Jörn Franke
As far as I know the TableSnapshotInputFormat relies on a temporary folder https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html Unfortunately some inputformats need a (local) tmp Directory. Sometimes this cannot be avoided. See also the source:

Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread Jörn Franke
What are you trying to achieve ? You should not use global variables in a spark application. Especially not adding to a list - that makes in most cases no sense. If you want to put everything into a file then you should repartition to 1 . > On 7. Apr 2018, at 19:07, klrmowse

Re: Best way to Hive to Spark migration

2018-04-04 Thread Jörn Franke
You need to provide more context on what you do currently in Hive and what do you expect from the migration. > On 5. Apr 2018, at 05:43, Pralabh Kumar wrote: > > Hi Spark group > > What's the best way to Migrate Hive to Spark > > 1) Use HiveContext of Spark > 2) Use

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics). > On 29. Mar 2018, at 02:03, Tin Vu

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke
Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different? > On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > >

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Jörn Franke
SparkR does not mean all libraries of R are executed by magic in a distributed fashion that scales with the data. In fact that is similar to many other analytical software. They have the possibility to run things in parallel but the libraries themselves are not using them. Reason is that it is

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Jörn Franke
Write your own Spark UDF. Apply it to all varchar columns. Within this udf you can use the SimpleDateFormat parse method. If this method returns null you return the content as varchar if not you return a date. If the content is null you return null. Alternatively you can define an insert

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Jörn Franke
Maybe you should better run it in yarn cluster mode. Yarn client would start the driver on the oozie server. > On 19. Mar 2018, at 12:58, Serega Sheypak wrote: > > I'm trying to run it as Oozie java action and reduce env dependency. The only > thing I need is Hadoop

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Jörn Franke
doesn't > leverage Spark's parallel processing, which I want to do for large and huge > amount of EDI data. > > Any pointers on that? > > Thanks, > Aakash. > >> On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> Maybe there a

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Jörn Franke
Maybe there are commercial ones. You could also some of the open source parser for xml. However xml is very inefficient and you need to du a lot of tricks to make it run in parallel. This also depends on type of edit message etc. sophisticated unit testing and performance testing is key.

Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Jörn Franke
I think most of the scala development in Spark happens with sbt - in the open source world. However, you can do it with Gradle and Maven as well. It depends on your organization etc. what is your standard. Some things might be more cumbersome too reach in non-sbt scala scenarios, but this is

Re: running Spark-JobServer in eclipse

2018-03-04 Thread Jörn Franke
I recommend to run it with your unit tests executed with your build tool. There is no need to have it in the ide running in the background. > On 3. Mar 2018, at 17:57, sujeet jog wrote: > > Is there a way to run Spark-JobServer in eclipse ?.. any pointers in this >

Re: sqoop import job not working when spark thrift server is running.

2018-02-24 Thread Jörn Franke
Fairscheduler in yarn provides you the possibility to use more resources than configured if they are available On 24. Feb 2018, at 13:47, akshay naidu wrote: >> it sure is not able to get sufficient resources from YARN to start the >> containers. > that's right. I

Re: parquet vs orc files

2018-02-22 Thread Jörn Franke
s, how does min/max index work? Can spark itself configure bloom filters > when saving as orc? > >> On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> In the latest version both are equally well supported. >> >> You need to inse

Re: parquet vs orc files

2018-02-21 Thread Jörn Franke
In the latest version both are equally well supported. You need to insert the data sorted on filtering columns Then you will benefit from min max indexes and in case of orc additional from bloom filters, if you configure them. In any case I recommend also partitioning of files (do not confuse

Re: Can spark handle this scenario?

2018-02-17 Thread Jörn Franke
You may want to think about separating the import step from the processing step. It is not very economical to download all the data again every time you want to calculate something. So download it first and store it on a distributed file system. Schedule to download newest information every

Re: Spark cannot find tables in Oracle database

2018-02-11 Thread Jörn Franke
Maybe you do not have access to the table/view. Incase of a view it could be also that you do not have access to the underlying table. Have you tried with another sql tool to access it? > On 11. Feb 2018, at 03:26, Lian Jiang wrote: > > Hi, > > I am following >

Re: Log analysis with GraphX

2018-02-10 Thread Jörn Franke
What do you mean by path analysis and clicking trends? If you want to use typical graph algorithm such as longest path, shortest path (to detect issues with your navigation page) or page rank then probably yes. Similarly if you do a/b testing to compare if you sell more with different

Re: S3 token times out during data frame "write.csv"

2018-01-28 Thread Jörn Franke
He is using CSV and either ORC or parquet would be fine. > On 28. Jan 2018, at 06:49, Gourav Sengupta wrote: > > Hi, > > There is definitely a parameter while creating temporary security credential > to mention the number of minutes those credentials will be active.

Re: S3 token times out during data frame "write.csv"

2018-01-23 Thread Jörn Franke
How large is the file? If it is very large then you should have anyway several partitions for the output. This is also important in case you need to read again from S3 - having several files there enables parallel reading. > On 23. Jan 2018, at 23:58, Vasyl Harasymiv

Re: run spark job in yarn cluster mode as specified user

2018-01-22 Thread Jörn Franke
Configure Kerberos > On 22. Jan 2018, at 08:28, sd wang wrote: > > Hi Advisers, > When submit spark job in yarn cluster mode, the job will be executed by > "yarn" user. Any parameters can change the user? I tried setting > HADOOP_USER_NAME but it did not work. I'm

Re: Processing huge amount of data from paged API

2018-01-21 Thread Jörn Franke
Which device provides messages as thousands of http pages? This is obviously inefficient and it will not help much to run them in parallel. Furthermore with paging you risk that messages get los or you get duplicate messages. I still not get why nowadays applications download a lot of data

Re: Reading Hive RCFiles?

2018-01-20 Thread Jörn Franke
Forgot to add the mailinglist > On 18. Jan 2018, at 18:55, Jörn Franke <jornfra...@gmail.com> wrote: > > Welll you can use: > https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Cla

Re: Saving each line of RDD as a separate file with key as the file name

2018-01-20 Thread Jörn Franke
Not sure if I understood exactly what you need, but you could have one partition by line. Alternatively you could use the MultipleOutput format in Hadoop. > On 20. Jan 2018, at 22:56, pooja bhojwani wrote: > > Hi all, > > So, I have a Java Pair RDD with let’s say n

Re: Spark Streaming not reading missed data

2018-01-16 Thread Jörn Franke
It could be a missing persist before the checkpoint > On 16. Jan 2018, at 22:04, KhajaAsmath Mohammed > wrote: > > Hi, > > Spark streaming job from kafka is not picking the messages and is always > taking the latest offsets when streaming job is stopped for 2 hours.

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-15 Thread Jörn Franke
I think you look more for algorithms for unsupervised learning, eg clustering. Depending on the characteristics different clusters might be created , eg donor or non-donor. Most likely you may find also more clusters (eg would donate but has a disease preventing it or too old). You can verify

Re: 3rd party hadoop input formats for EDI formats

2018-01-15 Thread Jörn Franke
I do not want to make advertisement for certain third party components. Hence, just some food for thought: Python Pandas supports some of those formats (it is not an inputformat though). Some commercial offers just provide etl to convert it into another format supported already by Spark . Then

Re: [Spark SQL] How to run a custom meta query for `ANALYZE TABLE`

2018-01-02 Thread Jörn Franke
Hi, No this is not possible with the current data source API. However, there is a new data source API v2 on its way - maybe it will support it. Alternatively, you can have a config option to calculate meta data after an insert. However, could you please explain more for which dB your

Re: Spark Docker

2017-12-25 Thread Jörn Franke
You find several presentations on this at the Spark summit web page. Generally you have also to make a decision if you run one cluster for all applications or one cluster per application in the container context. Not sure though why do you want to run just on one node. If you have only one

Re: Reading data from OpenTSDB or KairosDB

2017-12-21 Thread Jörn Franke
There are datasource for Cassandra and hbase, however I am not sure how useful they are, because then you need to do also implement the logic of opentsdb or kairosdb. Better to implement your own data sources. Then, there are several projects enabling timeseries queries in Spark, but I am not

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Jörn Franke
This is correct behavior. If you need to call another method simply append another map, flatmap or whatever you need. Depending on your use case you may use also reduce and reduce by key. However you never (!) should use a global variable as in your snippet. This can to work because you work in

Re: NASA CDF files in Spark

2017-12-16 Thread Jörn Franke
Develop your own HadoopFileFormat and use https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/SparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class) to load. The Spark datasource API will be relevant for you in

Re: Spark loads data from HDFS or S3

2017-12-13 Thread Jörn Franke
S3 can be realized cheaper than HDFS on Amazon. As you correctly describe it does not support data locality. The data is distributed to the workers. Depending on your use case it can make sense to have HDFS as a temporary “cache” for S3 data. > On 13. Dec 2017, at 09:39, Philip Lee

Re: SparkSQL not support CharType

2017-11-23 Thread Jörn Franke
Or bytetype depending on the use case > On 23. Nov 2017, at 10:18, Herman van Hövell tot Westerflier > wrote: > > You need to use a StringType. The CharType and VarCharType are there to > ensure compatibility with Hive and ORC; they should not be used anywhere

Re: build spark source code

2017-11-22 Thread Jörn Franke
You can check if Apache Bigtop provided you something like this for Spark on Windows (well probably not based on sbt but mvn). > On 23. Nov 2017, at 03:34, Michael Artz wrote: > > It would be nice if I could download the source code of spark from github, > then build

Re: Spark based Data Warehouse

2017-11-12 Thread Jörn Franke
What do you mean all possible workloads? You cannot prepare any system to do all possible processing. We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jörn Franke
Scala 2.12 is not yet supported on Spark - this means also not JDK9: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-14220 If you look at the Oracle support then jdk 9 is anyway only supported for 6 months. JDK 8 is Lts (5 years) JDK 18.3 will be only 6 months and JDK 18.9 is

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Jörn Franke
See also https://spark.apache.org/docs/latest/job-scheduling.html > On 27. Oct 2017, at 08:05, Cassa L wrote: > > Hi, > I have a spark job that has use case as below: > RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some > transformation and after that I

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Jörn Franke
Do you use yarn ? Then you need to configure the queues with the right scheduler and method. > On 27. Oct 2017, at 08:05, Cassa L wrote: > > Hi, > I have a spark job that has use case as below: > RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some >

Re: text processing in spark (Spark job stucks for several minutes)

2017-10-26 Thread Jörn Franke
Please provide source code and exceptions that are in executor and/or driver log. > On 26. Oct 2017, at 08:42, Donni Khan wrote: > > Hi, > I'm applying preprocessing methods on big data of text by using spark-Java. I > created my own NLP pipline as a normal java

Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Jörn Franke
Well the meta information is in the file so I am not surprised that it reads the file, but it should not read all the content, which is probably also not happening. > On 24. Oct 2017, at 18:16, Siva Gudavalli > wrote: > > > Hello, > > I have an update

Re: Bulk load to HBase

2017-10-22 Thread Jörn Franke
Before you look at any new library/tool: What is the process of importing, what is the original file format, file size, compression etc . once you have investigated this you can start improving it. Then, as a last step a new framework can be explored. Feel free to share those and we can help you

Re: Is Spark suited for this use case?

2017-10-16 Thread Jörn Franke
Hi, What is the motivation behind your question? Save costs? You seem to be happy with the functional/non-functional requirements. So the only thing that it could be is cost or need for innovation in the future. Best regards > On 16. Oct 2017, at 06:32, van den Heever, Christian CC >

Re: Near Real time analytics with Spark and tokenization

2017-10-15 Thread Jörn Franke
Can’t you cache the token vault in a caching solution , such as Ignite? The lookup of single tokens would be really fast. About what volumes one talks about? I assume you refer to PCI DSS, so security might be an important aspect which might be not that easy to achieve with vault-less

Re: Kafka 010 Spark 2.2.0 Streaming / Custom checkpoint strategy

2017-10-13 Thread Jörn Franke
HDFS can be r placed by other filesystem plugins (eg ignitefs, s3, etc) so the easiest is to write a file system plugin. This is not a plug-in for Spark but part of the Hadoop functionality used by Spark. > On 13. Oct 2017, at 17:41, Anand Chandrashekar wrote: > >

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
;> like conf to underlying hadoop config.essentially you should be able to >> control behaviour of split as you can do in a map-reduce program (as Spark >> uses the same input format) >> >>> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <jornfra...@gmail.com>

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Write your own input format/datasource or split the file yourself beforehand (not recommended). > On 10. Oct 2017, at 09:14, Kanagha Kumar wrote: > > Hi, > > I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", > minPartitions). > > How can I

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Jörn Franke
You should use a distributed filesystem such as HDFS. If you want to use the local filesystem then you have to copy each file to each node. > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: > > Hi All, > > I have multi node architecture of (1 master,2 workers) Spark

Re: More instances = slower Spark job

2017-09-28 Thread Jörn Franke
It looks to me a little bit strange. First json.gz files are single threaded, ie each file can only be processed by one thread (so it is good to have many files of around 128 MB to 512 MB size each). Then what you do in the code is already done by the data source. There is no need to read the

Re: Where can I get few GBs of sample data?

2017-09-28 Thread Jörn Franke
I think just any Dataset is not useful. The data should be close to the real data that you want to process. Similarly, the processing should be the same as you plan. > On 28. Sep 2017, at 18:04, Gaurav1809 wrote: > > Hi All, > > I have setup multi node spark cluster

Re: Apache Spark - MLLib challenges

2017-09-23 Thread Jörn Franke
As far as I know there is currently no encryption in-memory in Spark. There are some research projects to create secure enclaves in-memory based on Intel sgx, but there is still a lot to do in terms of performance and security objectives. The more interesting question is why would you need this

Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Jörn Franke
It depends on the permissions the user has on the local file system or HDFS, so there is no need to have grant/revoke. > On 15. Sep 2017, at 17:13, Arun Khetarpal wrote: > > Hi - > > Wanted to understand if spark sql has GRANT and REVOKE statements available? > Is

Re: UI for spark machine learning.

2017-08-22 Thread Jörn Franke
Is it really required to have one billion samples for just linear regression? Probably your model would do equally well with much less samples. Have you checked bias and variance if you use much less random samples? > On 22. Aug 2017, at 12:58, Sea aj wrote: > > I have a

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
org/jira/browse/SPARK-20049 > > I saw something in the above link not sure if that is same thing in my case. > > Thanks, > Asmath > >> On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfra...@gmail.com> wrote: >> Have you made sure that the saveastable stores them as par

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Have you made sure that the saveastable stores them as parquet? > On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > we are using parquet tables, is it causing any performance issue? > >> On Sun, Aug 20, 2017 at 9:09 AM, Jörn F

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
20. Aug 2017, at 15:52, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > Yes we tried hive and want to migrate to spark for better performance. I am > using paraquet tables . Still no better performance while loading. > > Sent from my iPhone > >&g

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Have you tried directly in Hive how the performance is? In which Format do you expect Hive to write? Have you made sure it is in this format? It could be that you use an inefficient format (e.g. CSV + bzip2). > On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed > wrote:

Re: Logging in unit tests

2017-08-19 Thread Jörn Franke
Are you in Gradle or something similar for building ? > On 19. Aug 2017, at 11:58, Pascal Stammer wrote: > > Hi all, > > I am writing unit tests for my spark application. In the rest of the project > I am using log4j2.xml files to configure logging. Now I am running in

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-18 Thread Jörn Franke
it to a datetime format, which is making >>> it this - >>> >>> >>> from pyspark.sql.functions import from_unixtime, unix_timestamp >>> >>> df2 = dflead.select('Enter_Date', >>> >>> from_unixtime(unix_timestamp('Enter_Date', 'MM/dd/yyy')

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Jörn Franke
You can use Apache POI DateUtil to convert double to Date (https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/hadoopoffice/wiki), it supports Spark 1.x or Spark 2.0 ds. > On 16. Aug 2017, at 20:15,

Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Jörn Franke
What about accumulators ? > On 14. Aug 2017, at 20:15, Lukas Bradley wrote: > > We have had issues with gathering status on long running jobs. We have > attempted to draw parallels between the Spark UI/Monitoring API and our code > base. Due to the separation between

Re: XML Parsing with Spark and SCala

2017-08-11 Thread Jörn Franke
Can you specify what "is not able to load" means and what are the expected results? > On 11. Aug 2017, at 09:30, Etisha Jain wrote: > > Hi > > I want to do xml parsing with spark, but the data from the file is not able > to load and the desired output is also not

Re: Multiple queries on same stream

2017-08-09 Thread Jörn Franke
This is not easy to say without testing. It depends on type of computation etc. it also depends on the Spark version. Generally vectorization / SIMD could be much faster if it is applied by Spark / the JVM in scenario 2. > On 9. Aug 2017, at 07:05, Raghavendra Pandey

Re: DataSet creation not working Spark 1.6.0 , populating wrong data CDH 5.7.1

2017-08-03 Thread Jörn Franke
You need to create a schema for person. https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema > On 3. Aug 2017, at 12:09, Rabin Banerjee wrote: > > Hi All, > > I am trying to create a DataSet from DataFrame, where

Re: Runnig multiple spark jobs on yarn

2017-08-02 Thread Jörn Franke
And if the yarn queues are configured as such > On 2. Aug 2017, at 16:47, ayan guha wrote: > > Each of your spark-submit will create separate applications in YARN and run > concurrently (if you have enough resource, that is) > >> On Thu, Aug 3, 2017 at 12:42 AM, serkan

Re: Quick one on evaluation

2017-08-02 Thread Jörn Franke
I assume printschema would not trigger an evaluation. Show might partially triggger an evaluation (not all data is shown only a certain number of rows by default). Keep in mind that even a count might not trigger evaluation of all rows (especially in the future) due to updates on the optimizer.

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Jörn Franke
Try sparksession.conf().set > On 28. Jul 2017, at 12:19, Chetan Khatri wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing > below issue: > > org.apache.hadoop.hive.ql.metadata.HiveException: > Number of

Re: real world spark code

2017-07-25 Thread Jörn Franke
that? > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics, LLC > 913.938.6685 > www.massstreet.net > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Tues

Re: real world spark code

2017-07-25 Thread Jörn Franke
Look for the ones that have unit and integration tests as well as a ci+reporting on code quality. All the others are just toy examples. Well should be :) > On 25. Jul 2017, at 01:08, Adaryl Wakefield > wrote: > > Anybody know of publicly available GitHub repos

Re: using Kudu with Spark

2017-07-24 Thread Jörn Franke
I guess you have to find out yourself with experiments. Cloudera has some benchmarks, but it always depends what you test, your data volume and what is meant by "fast". It is also more than a file format with servers that communicate with each other etc. - more complexity. Of course there are

Re: custom joins on dataframe

2017-07-24 Thread Jörn Franke
It might be faster if you add the column with the hash result before the join to the dataframe and then do simply a normal join on that column > On 22. Jul 2017, at 17:39, Stephen Fletcher > wrote: > > Normally a family of joins (left, right outter, inner) are

Re: how does spark handle compressed files

2017-07-19 Thread Jörn Franke
Spark uses the Hadoop API to access files. This means they are transparently decompressed. However gzip can be only decompressed in a single thread / file and bzip2 is very slow. The best is either to have multiple files (each one at least the size of a HDFS block) or better to use a modern

Re: VS: Using Spark as a simulator

2017-07-07 Thread Jörn Franke
t; Lähettäjä: Mahesh Sawaiker <mahesh_sawai...@persistent.com> > Lähetetty: 21. kesäkuuta 2017 14:45 > Vastaanottaja: Esa Heikkinen; Jörn Franke > Kopio: user@spark.apache.org > Aihe: RE: Using Spark as a simulator > > Spark can help you to create one large file if needed, bu

Re: PySpark working with Generators

2017-06-30 Thread Jörn Franke
In this case i do not see so many benefits of using Spark. Is the data volume high? Alternatively i recommend to convert the proprietary format into a format Sparks understand and then use this format in Spark. Another alternative would be to write a custom Spark datasource. Even your

Re: "Sharing" dataframes...

2017-06-20 Thread Jörn Franke
You could all express it in one program, alternatively ignite in memory file system or the ignite sharedrdd ( not sure if dataframe is supported) > On 20. Jun 2017, at 19:46, Jean Georges Perrin wrote: > > Hey, > > Here is my need: program A does something on a set of data and

Re: Using Spark as a simulator

2017-06-20 Thread Jörn Franke
It is fine, but you have to design it that generated rows are written in large blocks for optimal performance. The most tricky part with data generation is the conceptual part, such as probabilistic distribution etc You have to check as well that you use a good random generator, for some cases

Re: fetching and joining data from two different clusters

2017-06-18 Thread Jörn Franke
press.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any moneta

Re: fetching and joining data from two different clusters

2017-06-15 Thread Jörn Franke
sk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or des

Re: fetching and joining data from two different clusters

2017-06-15 Thread Jörn Franke
citly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 15 June 2017 at 17:05, Jörn Franke <jornfra...@gmail.com> wrote: >> It does not matter to Spark you just put the HDFS

Re: fetching and joining data from two different clusters

2017-06-15 Thread Jörn Franke
It does not matter to Spark you just put the HDFS URL of the namenode there. Of course the issue is that you loose data locality, but this would be also the case for Oracle. > On 15. Jun 2017, at 18:03, Mich Talebzadeh wrote: > > Hi, > > With Spark how easy is it

Re: Spark Streaming Design Suggestion

2017-06-13 Thread Jörn Franke
I do not fully understand the design here. Why not send all to one topic with some application id in the message and you write to one topic also indicating the application id. Can you elaborate a little bit more on the use case? Especially applications deleting/creating topics dynamically can

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-11 Thread Jörn Franke
Is sentry preventing the access? > On 11. Jun 2017, at 01:55, vaquar khan wrote: > > Hi , > Pleaae check your firewall security setting sharing link one good link. > > http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1 > > > > Regards,

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-08 Thread Jörn Franke
On Thu, Jun 8, 2017 at 12:11 PM Jörn Franke <jornfra...@gmail.com> wrote: >> The CSV data source allows you to skip invalid lines - this should also >> include lines that have more than maxColumns. Choose mode "DROPMALFORMED" >> >>> On 8. Jun 2017, a

Re: Scala, Python or Java for Spark programming

2017-06-08 Thread Jörn Franke
(we try to avoid excessive use of tuples, use named >>> functions, etc.) Given these constraints, I find Scala to be very >>> readable, and far easier to use than Java. The Lambda functionality of >>> Java provides a lot of similar features, but the amount of typing r

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
The CSV data source allows you to skip invalid lines - this should also include lines that have more than maxColumns. Choose mode "DROPMALFORMED" > On 8. Jun 2017, at 03:04, Chanh Le <giaosu...@gmail.com> wrote: > > Hi Takeshi, Jörn Franke, > > The problem is

Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Jörn Franke
I think this is a religious question ;-) Java is often underestimated, because people are not aware of its lambda functionality which makes the code very readable. Scala - it depends who programs it. People coming with the normal Java background write Java-like code in scala which might not be

Re: [CSV] If number of columns of one row bigger than maxcolumns it stop the whole parsing process.

2017-06-07 Thread Jörn Franke
Spark CSV data source should be able > On 7. Jun 2017, at 17:50, Chanh Le wrote: > > Hi everyone, > I am using Spark 2.1.1 to read csv files and convert to avro files. > One problem that I am facing is if one row of csv file has more columns than > maxColumns (default is

Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-06 Thread Jörn Franke
What does your Spark job do? Have you tried standard configurations and changing them gradually? Have you checked the logfiles/ui which tasks take long? 17 Mio records does not sound much, but it depends what you do with it. I do not think that for such a small "cluster" it makes sense to

Re: Java SPI jar reload in Spark

2017-06-06 Thread Jörn Franke
Why do you need jar reloading? What functionality is executed during jar reloading. Maybe there is another way to achieve the same without jar reloading. In fact, it might be dangerous from a functional point of view- functionality in jar changed and all your computation is wrong. > On 6. Jun

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Jörn Franke
Hi, I have done this (not Isilon, but another storage system). It can be efficient for small clusters and depending on how you design the network. What I have also seen is the microservice approach with object stores (e.g. In the cloud s3, on premise swift) which is somehow also similar. If

Re: Dynamically working out upperbound in JDBC connection to Oracle DB

2017-05-29 Thread Jörn Franke
I think you need to remove the hyphen around maxid > On 29. May 2017, at 18:11, Mich Talebzadeh wrote: > > Hi, > > This JDBC connection works with Oracle table with primary key ID > > val s = HiveContext.read.format("jdbc").options( > Map("url" -> _ORACLEserver, >

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Jörn Franke
Just load it as from any other directory. > On 26. May 2017, at 17:26, Priya PM <pmpr...@gmail.com> wrote: > > > -- Forwarded message -- > From: Priya PM <pmpr...@gmail.com> > Date: Fri, May 26, 2017 at 8:54 PM > Subject: Re: Spark checkpoint

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Jörn Franke
Do you have some source code? Did you set the checkpoint directory ? > On 26. May 2017, at 16:06, Priya wrote: > > Hi, > > With nonstreaming spark application, did checkpoint the RDD and I could see > the RDD getting checkpointed. I have killed the application after >

Re: How to see the full contents of dataset or dataframe is structured streaming?

2017-05-18 Thread Jörn Franke
You can also write it into a file and view it using your favorite viewer/editor > On 18. May 2017, at 04:55, kant kodali wrote: > > Hi All, > > How to see the full contents of dataset or dataframe is structured streaming > just like we normally with df.show(false)? Is

Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread Jörn Franke
The issue might be group by , which under certain circumstances can cause a lot of traffic to one node. This transfer is of course obsolete the less nodes you have. Have you checked in the UI what it reports? > On 17. May 2017, at 17:13, Junaid Nasir wrote: > > I have a large

<    1   2   3   4   5   6   >