Re: CROSSVALIDATION and hypotetic fail

2017-05-12 Thread Jörn Franke
Use several jobs and orchestrate them, e.g. Via Ozzie. These jobs then can save intermediate results to disk and load them from there. Alternatively (or additionally!) you may use persist (to memory and disk), but I am not sure this is suitable for such long running applications. > On 12. May

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
"city": "Taipei", >>> "localName": "Taoyuan Intl.", >>> "airportCityState": "Taipei, Taiwan" >>> &g

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
city": "Taipei", >>> "localName": "Taoyuan Intl.", >>> "airportCityState": "Taipei, Taiwan" >>> >>> >>>

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Depends on your queries, the data structure etc. generally flat is better, but if your query filter is on the highest level then you may have better performance with a nested structure, but it really depends > On 30. Apr 2017, at 10:19, Zeming Yu wrote: > > Hi, > > We're

Re: Securing Spark Job on Cluster

2017-04-28 Thread Jörn Franke
is spilling temp file, shuffle data and > application data ? > > Thanks > Shashi > >> On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> You can use disk encryption as provided by the operating system. >> Additionally, you may thi

Re: Securing Spark Job on Cluster

2017-04-28 Thread Jörn Franke
You can use disk encryption as provided by the operating system. Additionally, you may think about shredding disks after they are not used anymore. > On 28. Apr 2017, at 14:45, Shashi Vishwakarma > wrote: > > Hi All > > I was dealing with one the spark requirement

Re: Arraylist is empty after JavaRDD.foreach

2017-04-24 Thread Jörn Franke
I am not sure what you try to achieve here. You should never use the arraylist as you use it here as a global variable (an anti-pattern). Why don't you use the count function of the dataframe? > On 24. Apr 2017, at 19:36, Devender Yadav > wrote: > > Hi All, > >

Re: splitting a huge file

2017-04-21 Thread Jörn Franke
What is your DWH technology? If the file is on HDFS and depending on the format than Spark can read parts of it in parallel. > On 21. Apr 2017, at 20:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
To my best knowledge, HBase works best for record around hundreds of KB and > it requires extra work of the cluster administrator. So this would be the > last option. > > > Thanks! > > > > Mo Tao > > 发件人: Jörn Franke <jornfra...@gmail.com> > 发送时间: 2

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Jörn Franke
lease note that all processing will be done in Spark here. Please share your > thoughts. Thanks again. > >> On Mon, Apr 17, 2017 at 12:58 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> I think it highly depends on your requirements. There are various tools for >>

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
You need to sort the data by id otherwise q situation can occur where the index does not work. Aside from this, it sounds odd to put a 5 MB column using those formats. This will be also not so efficient. What is in the 5 MB binary data? You could use HAR or maybe Hbase to store this kind of

Re: Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-17 Thread Jörn Franke
I think it highly depends on your requirements. There are various tools for analyzing and visualizing data. How many concurrent users do you have? What analysis do they do? How much data is involved? Do they have to process the data all the time or can they live with sampling which increases

Re: unit testing in spark

2017-04-10 Thread Jörn Franke
I think in the end you need to check the coverage of your application. If your application is well covered on the job or pipeline level (depends however on how you implement these tests) then it can be fine. In the end it really depends on the data and what kind of transformation you

Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Jörn Franke
As far as I am aware in newer Spark versions a DataFrame is the same as Dataset[Row]. In fact, performance depends on so many factors, so I am not sure such a comparison makes sense. > On 8. Apr 2017, at 20:15, Shiyuan wrote: > > Hi Spark-users, > I came across a few

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Jörn Franke
Maybe using ranger or sentry would be the better choice to intercept those calls? > On 7. Apr 2017, at 16:32, Alvaro Brandon wrote: > > I was going through the SparkContext.textFile() and I was wondering at that > point does Spark communicates with HDFS. Since when

Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jörn Franke
How do you read them? > On 7. Apr 2017, at 12:11, Jacek Laskowski wrote: > > Hi, > > If your Spark app uses snappy in the code, define an appropriate library > dependency to have it on classpath. Don't rely on transitive dependencies. > > Jacek > > On 7 Apr 2017 8:34

Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Jörn Franke
I do think this is the right way, you will have to do testing with test data verifying that the expected output of the calculation is the output. Even if the logical Plan Is correct your calculation might not be. E.g. There can be bugs in Spark, in the UI or (what is very often) the client

Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
And which version does your Spark cluster use? > On 6. Apr 2017, at 16:11, nayan sharma <nayansharm...@gmail.com> wrote: > > scalaVersion := “2.10.5" > > > > >> On 06-Apr-2017, at 7:35 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> &

Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
pration-assembly-1.0.jar | grep csv >> >> after doing this I have found a lot of classes under >> com/databricks/spark/csv/ >> >> do I need to check for any specific class ?? >> >> Regards, >> Nayan >>> On 06-Apr-2017, at 6:42 PM, Jörn Franke &

Re: Error while reading the CSV

2017-04-06 Thread Jörn Franke
Is the library in your assembly jar? > On 6. Apr 2017, at 15:06, nayan sharma wrote: > > Hi All, > I am getting error while loading CSV file. > > val > datacsv=sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").load("timeline.csv") >

Re: Update DF record with delta data in spark

2017-04-02 Thread Jörn Franke
If you trust that your delta file is correct then this might be the way forward. You just have to keep in mind that sometimes you can have several delta files in parallel and you need to apply then in the correct order or otherwise a deleted row might reappear again. Things get more messy if a

Re: Partitioning strategy

2017-04-02 Thread Jörn Franke
You can always repartition, but maybe for your use case different rdds with the same data, but different partition strategies could make sense. It may also make sense to choose an appropriate format on disc (orc, parquet). You have to choose based also on the users' non-functional requirements.

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Jörn Franke
$BUILD_SBT_FILE << ! >>> > lazy val root = (project in file(".")). >>> > settings( >>> > name := "${APPLICATION}", >>> > version := "1.0", >>> > scalaVersion := "2.11.8", &

Re: Upgrade the scala code using the most updated Spark version

2017-03-27 Thread Jörn Franke
Usually you define the dependencies to the Spark library as provided. You also seem to mix different Spark versions which should be avoided. The Hadoop library seems to be outdated and should also only be provided. The other dependencies you could assemble in a fat jar. > On 27 Mar 2017, at

Re: Persist RDD doubt

2017-03-23 Thread Jörn Franke
What do you mean by clear ? What is the use case? > On 23 Mar 2017, at 10:16, nayan sharma wrote: > > Does Spark clears the persisted RDD in case if the task fails ? > > Regards, > > Nayan

Re: Custom Spark data source in Java

2017-03-22 Thread Jörn Franke
DO Auto-generated method stub > return null; > } > > } > > which fails too... > > java.lang.NullPointerException > at org.apache.spark.sql.execution.datasources.LogicalRelation.( > LogicalRelation.scala:40) > at org.apache.spark.sql.SparkSession.baseRelationToDataFrame( > SparkSession.sc

Re: Custom Spark data source in Java

2017-03-22 Thread Jörn Franke
I think you can develop a Spark data source in Java, but you are right most use for the glue Spark even if they have a Java library (this is what I did for the project I open sourced). Coming back to your question, it is a little bit difficult to assess the exact issue without the code. You

Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Jörn Franke
Hi, The Spark CSV parser has different parsing modes: * permissive (default) tries to read everything and missing tokens are interpreted as null and extra tokens are ignored * dropmalformed drops lines which have more or less tokens * failfast - runtimexception if there is a malformed line

Re: Spark and continuous integration

2017-03-14 Thread Jörn Franke
this with ease, I was just wondering >> what people are using. >> >> Jenkins seems to have the best spark plugins so we are investigating that as >> well as a variety of other hosted CI tools >> >> Happy to write a blog post detailing our findings and sharing i

Re: Spark and continuous integration

2017-03-13 Thread Jörn Franke
Hi, Jenkins also now supports pipeline as code and multibranch pipelines. thus you are not so dependent on the UI and you do not need anymore a long list of jobs for different branches. Additionally it has a new UI (beta) called blueocean, which is a little bit nicer. You may also check GoCD.

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-09 Thread Jörn Franke
I find this question strange. There is no best tool for every use case. For example, both tools mentioned below are suitable for different purposes, sometimes also complementary. > On 9 Mar 2017, at 20:37, Gaurav1809 wrote: > > Hi All, Would you please let me know

Re: Apparent memory leak involving count

2017-03-09 Thread Jörn Franke
You seem to generate always a new rdd instead of reusing the existing. So I does not seem surprising that the memory need is growing. > On 9 Mar 2017, at 15:24, Facundo Domínguez wrote: > > Hello, > > Some heap profiling shows that memory grows under a TaskMetrics

Re: How to unit test spark streaming?

2017-03-07 Thread Jörn Franke
This depends on your target setup! I run for example for my open source libraries for spark integration tests (a dedicated folder a side the unit tests) a local spark master, but also use a minidfs cluster (to simulate HDFS on a node) and sometimes also a miniyarn cluster (see

Re: Spark JDBC reads

2017-03-07 Thread Jörn Franke
Can you provide some source code? I am not sure I understood the problem . If you want to do a preprocessing at the JDBC datasource then you can write your own data source. Additionally you may want to modify the sql statement to extract the data in the right format and push some preprocessing

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Jörn Franke
I agree with the others that a dedicated NoSQL datastore can make sense. You should look at the lambda architecture paradigm. Keep in mind that more memory does not necessarily mean more performance. It is the right data structure for the queries of your users. Additionally, if your queries

Re: kafka and zookeeper set up in prod for spark streaming

2017-03-03 Thread Jörn Franke
I think this highly depends on the risk that you want to be exposed to. If you have it on dedicated nodes there is less influence of other processes. I have seen both: on Hadoop nodes or dedicated. On Hadoop I would not recommend to put it on data nodes/heavily utilized nodes. Zookeeper does

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Jörn Franke
I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not for loading in a DWH. You need also define what realtime means and what is needed there - it will differ from client to client significantly. From my experience, just SQL is not enough for the users

Re: Run spark machine learning example on Yarn failed

2017-02-28 Thread Jörn Franke
You do not need to place it in every local directory of every node. Just use hadoop fs -put to put it on HDFS. Alternatively as others suggested use s3 > On 28 Feb 2017, at 02:18, Yunjie Ji wrote: > > After start the dfs, yarn and spark, I run these code under the root >

Re: extracting eventlogs saved snappy format.

2017-02-15 Thread Jörn Franke
What do you want to do with the event log ? The Hadoop command line can show compressed files (hadoop fs -text). Alternatively there are tools depending on your os ... you can also write a small job to do this and run it on the cluster. > On 15 Feb 2017, at 10:55, satishl

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Jörn Franke
Can you check in the UI which tasks took most of the time? Even the 45 min looks a little bit much given that you only work most of the time with 50k rows > On 15 Feb 2017, at 00:03, Timur Shenkao wrote: > > Hello, > I'm not sure that's your reason but check this

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Jörn Franke
> successful writing of the new one. > Thanks, > Assaf. > > > From: Steve Loughran [mailto:ste...@hortonworks.com] > Sent: Tuesday, February 14, 2017 3:25 PM > To: Mendelson, Assaf > Cc: Jörn Franke; user > Subject: Re: fault tolerant dataframe write with overwrite

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Jörn Franke
Normally you can fetch the filesystem interface from the configuration ( I assume you mean URI). Managing to get the last iteration: I do not understand the issue. You can have as the directory the current timestamp and at the end you simply select the directory with the highest number.

Re: wholeTextfiles not parallel, runs out of memory

2017-02-14 Thread Jörn Franke
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum > On 14 Feb 2017, at 09:36, Henry Tremblay wrote: > > When I use wholeTextFiles, spark does not run in parallel, and yarn runs out > of

Re: Parquet Gzipped Files

2017-02-13 Thread Jörn Franke
Your vendor should use the parquet internal compression and not take a parquet file and gzip it. > On 13 Feb 2017, at 18:48, Benjamin Kim wrote: > > We are receiving files from an outside vendor who creates a Parquet data file > and Gzips it before delivery. Does anyone

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
Cf. also https://spark.apache.org/docs/latest/job-scheduling.html > On 12 Feb 2017, at 11:30, Jörn Franke <jornfra...@gmail.com> wrote: > > I think you should have a look at the spark documentation. It has something > called scheduler who does exactly this. In more sophisti

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
. > On 12 Feb 2017, at 11:45, Sean Owen <so...@cloudera.com> wrote: > > No this use case is perfectly sensible. Yes it is thread safe. > >> On Sun, Feb 12, 2017, 10:30 Jörn Franke <jornfra...@gmail.com> wrote: >> I think you should have a look at the spark docume

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
sistent result. So my question is, what, if any are the legal > operations to use on a dataframe so I could do the above. > > Thanks, > Assaf. > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Sunday, February 12, 2017 10:39 AM > To: Men

Re: Remove dependence on HDFS

2017-02-12 Thread Jörn Franke
You're have to carefully choose if your strategy makes sense given your users workloads. Hence, I am not sure your reasoning makes sense. However, You can , for example, install openstack swift as an object store and use this as storage. HDFS in this case can be used as a temporary store

Re: is dataframe thread safe?

2017-02-12 Thread Jörn Franke
I am not sure what you are trying to achieve here. Spark is taking care of executing the transformations in a distributed fashion. This means you must not use threads - it does not make sense. Hence, you do not find documentation about it. > On 12 Feb 2017, at 09:06, Mendelson, Assaf

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Jörn Franke
Can you post more information about the number of files, their size and the executor logs. A gzipped file is not splittable i.e. Only one executor can gunzip it (the unzipped data can then be processed in parallel). Wholetextfile was designed to be executed only on one executor (e.g. For

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-08 Thread Jörn Franke
The resource management in yarn cluster mode is yarns task. So it dependents how you configured the queues and the scheduler there. > On 8 Feb 2017, at 12:10, Cosmin Posteuca wrote: > > I tried to run some test on EMR on yarn cluster mode. > > I have a cluster with

Re: does persistence required for single action ?

2017-02-07 Thread Jörn Franke
Depends on the use case, but a persist before checkpointing can make sense after some of the map steps. > On 8 Feb 2017, at 03:09, Shushant Arora wrote: > > Hi > > I have a workflow like below: > > rdd1 = sc.textFile(input); > rdd2 = rdd1.filter(filterfunc1); >

Re: Launching an Spark application in a subset of machines

2017-02-07 Thread Jörn Franke
If you want to run them always on the same machines use yarn node labels. If it is any 10 machines then use capacity or fair scheduler. What is the use case for running it always on the same 10 machines. If it is for licensing reasons then I would ask your vendor if this is a suitable mean to

Re: spark architecture question -- Pleas Read

2017-02-05 Thread Jörn Franke
hnical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 29 January 2017 at 22:22, Jörn Franke <jornfra...@gmail.com> wrote: >> You can use HDFS, S3, Azure,

Re: Spark 2 + Java + UDF + unknown return type...

2017-02-02 Thread Jörn Franke
Not sure what your udf is exactly doing, but why not on udf / type ? You avoid issues converting it, it is more obvious for the user of your udf etc You could of course return a complex type with one long, one string and one double and you fill them in the udf as needed, but this would be

Re: Is it okay to run Hive Java UDFS in Spark-sql. Anybody's still doing it?

2017-02-02 Thread Jörn Franke
There are many performance aspects here which may not only related to the UDF itself, but on configuration of platform, data etc. You seem to have a performance problem with your UDFs. Maybe you can elaborate on 1) what data you process (format, etc) 2) what you try to Analyse 3) how you

Re: Tableau BI on Spark SQL

2017-01-30 Thread Jörn Franke
able for any monetary damages arising from such > loss, damage or destruction. > > >> On 30 January 2017 at 21:51, Jörn Franke <jornfra...@gmail.com> wrote: >> Depending on the size of the data i recommend to schedule regularly an >> extract in tableau. There

Re: Tableau BI on Spark SQL

2017-01-30 Thread Jörn Franke
Depending on the size of the data i recommend to schedule regularly an extract in tableau. There tableau converts it to an internal in-memory representation outside of Spark (can also exist on disk if memory is too small) and then use it within Tableau. Accessing directly the database is not

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke
I meant with distributed file system such as Ceph, Gluster etc... > On 29 Jan 2017, at 14:45, Jörn Franke <jornfra...@gmail.com> wrote: > > One alternative could be the oracle Hadoop loader and other Oracle products, > but you have to invest some money and probably buy thei

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Jörn Franke
One alternative could be the oracle Hadoop loader and other Oracle products, but you have to invest some money and probably buy their Hadoop Appliance, which you have to evaluate if it make sense (can get expensive with large clusters etc). Another alternative would be to get rid of Oracle

Re: spark architecture question -- Pleas Read

2017-01-28 Thread Jörn Franke
Hard to tell. Can you give more insights on what you try to achieve and what the data is about? For example, depending on your use case sqoop can make sense or not. > On 28 Jan 2017, at 02:14, Sirisha Cheruvu wrote: > > Hi Team, > > RIght now our existing flow is > >

Re: Text

2017-01-27 Thread Jörn Franke
Sorry the message was not complete: the key is the file position, so if you sort by key the lines will be in the same order as in the original file > On 27 Jan 2017, at 14:45, Jörn Franke <jornfra...@gmail.com> wrote: > > I agree with the previous statements. You cannot expe

Re: Text

2017-01-27 Thread Jörn Franke
I agree with the previous statements. You cannot expect any ordering guarantee. This means you need to ensure that the same ordering is done as the original file. Internally Spark is using the Hadoop Client libraries - even if you do not have Hadoop installed, because it is a flexible

Re: Saving from Dataset to Bigquery Table

2017-01-20 Thread Jörn Franke
know if we can run this from > within our local machine? given that all the required jars are downloaded by > SBT anyways. > >> On 20 January 2017 at 11:22, Jörn Franke <jornfra...@gmail.com> wrote: >> It is only on pairdd >> >>> On 20 Jan 2017, at 11:54,

Re: Saving from Dataset to Bigquery Table

2017-01-20 Thread Jörn Franke
It is only on pairdd > On 20 Jan 2017, at 11:54, A Shaikh wrote: > > Has anyone experience saving Dataset to Bigquery Table? > > I am loading into BigQuery using the following example sucessfully. This uses > RDD.saveAsNewAPIHadoopDataset method to save data. > I am

Re: How to do dashboard reporting in spark

2017-01-19 Thread Jörn Franke
You can use zeppelin if you want to directly interact with Spark. For traditional tools you have the right ideas (any of them works depending on requirements) See also lambda architecture > On 20 Jan 2017, at 08:18, Gaurav1809 wrote: > > Hi All, > > > Once data is

Re: Quick but probably silly question...

2017-01-17 Thread Jörn Franke
You run compaction, i.e. save the modified/deleted records in a dedicated file. Every now and then you compare the original and delta file and create a new version. When querying before compaction then you need to check in original and delta file. I don to think orc need tez for it , but it

Re: Spark/Parquet/Statistics question

2017-01-17 Thread Jörn Franke
Hallo, I am not sure what you mean by min/max for strings. I do not know if this makes sense. What the ORC format has is bloom filters for strings etc. - are you referring to this? In order to apply min/max filters Spark needs to read the meta data of the file. If the filter is applied or

Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Jörn Franke
Avro itself supports it, but I am not sure if this functionality is available through the Spark API. Just out of curiosity, if your use case is only write to HDFS then you might use simply flume. > On 9 Jan 2017, at 09:58, awkysam wrote: > > Currently for our

Re: How to connect Tableau to databricks spark?

2017-01-08 Thread Jörn Franke
Firewall Ports open? Hint: for security reasons you should not connect via the internet. > On 9 Jan 2017, at 04:30, Raymond Xie wrote: > > I want to do some data analytics work by leveraging Databricks spark platform > and connect my Tableau desktop to it for data

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Jörn Franke
Why not upgrade to ojdbc7 - this one is for java 7+8? Keep in mind that the jdbc driver is updated constantly (although simply called ojdbc7). I would be surprised if this does not work with cloudera as it runs on the oracle big data appliance. > On 22 Dec 2016, at 21:44, Mich Talebzadeh

Re: Reading xls and xlsx files

2016-12-19 Thread Jörn Franke
I am currently developing one https://github.com/ZuInnoTe/hadoopoffice It contains working source code, but a release will likely be only beginning of the year (will include a Spark data source, but the existing source code can be used without issues in a Spark application). > On 19 Dec 2016,

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jörn Franke
In Hadoop you should not have many small files. Put them into a HAR. > On 13 Dec 2016, at 05:42, Jakob Odersky wrote: > > Assuming the bottleneck is IO, you could try saving your files to > HDFS. This will distribute your data and allow for better concurrent > reads. > >> On

Re: .tar.bz2 in spark

2016-12-08 Thread Jörn Franke
Tar is not out of the box supported. Just store the file as .json.bz2 without using tar. > On 8 Dec 2016, at 20:18, Maurin Lenglart wrote: > > Hi, > I am trying to load a json file compress in .tar.bz2 but spark throw an error. > I am using pyspark with spark 1.6.2.

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Jörn Franke
You need to do the book keeping of what has been processed yourself. This may mean roughly the following (of course the devil is in the details): Write down in zookeeper which part of the processing job has been done and for which dataset all the data has been created (do not keep the data

Re: Access multiple cluster

2016-12-04 Thread Jörn Franke
If you do it frequently then you may simply copy the data to the processing cluster. Alternatively, you could create an external table in the processing cluster to the analytics cluster. However, this has to be supported by appropriate security configuration and might be less an efficient then

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread Jörn Franke
I am not sure what use case you want to demonstrate with select count in general. Maybe you can elaborate more what your use case is. Aside from this: this is a Cassandra issue. What is the setup of Cassandra? Dedicated nodes? How many? Replication strategy? Consistency configuration? How is

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Jörn Franke
Use as a format orc, parquet or avro because they support any compression type with parallel processing. Alternatively split your file in several smaller ones. Another alternative would be bzip2 (but slower in general) or Lzo (usually it is not included by default in many distributions). > On

Re: How to write a custom file system?

2016-11-21 Thread Jörn Franke
Once you configured a custom file system in Hadoop it can be used by Spark out of the box. Depending what you implement in the custom file system you may think about side effects to any application including spark (memory consumption etc). > On 21 Nov 2016, at 18:26, Samy Dindane

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Jörn Franke
You can do the conversion of character set (is this the issue?) as part of your loading process in Spark. As far as i know the spark CSV package is based on Hadoop TextFileInputformat. This format to my best of knowledge supports only utf-8. So you have to do a conversion from windows to utf-8.

Re: AVRO File size when caching in-memory

2016-11-14 Thread Jörn Franke
spark version? Are you using tungsten? > On 14 Nov 2016, at 10:05, Prithish wrote: > > Can someone please explain why this happens? > > When I read a 600kb AVRO file and cache this in memory (using cacheTable), it > shows up as 11mb (storage tab in Spark UI). I have tried

Re: Possible DR solution

2016-11-12 Thread Jörn Franke
What is wrong with the good old batch transfer for transferring data from a cluster to another? I assume your use case is only business continuity in case of disasters such as data center loss, which are unlikely to happen (well it does not mean they do not happen) and where you could afford to

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Jörn Franke
Can you split the files beforehand in several files (e.g. By the column you do the join on?) ? > On 10 Nov 2016, at 23:45, Stuart White wrote: > > I have a large "master" file (~700m records) that I frequently join smaller > "transaction" files to. (The transaction

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Jörn Franke
Basically you mention the options. However, there are several ways how informatica can extract (or store) from/to rdbms. If the native option is not available then you need to go via JDBC as you have described. Alternatively (but only if it is worth it) you can schedule fetching of the files

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Jörn Franke
his email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 24 October 2016 at 17:09, Jörn Franke <jornfra...@gmail.com> wrote: >> Bigtop contain

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Jörn Franke
Bigtop contains a random data generator mainly for transactions, but it could be rather easily adapted > On 24 Oct 2016, at 18:04, Mich Talebzadeh wrote: > > me being lazy > > Does anyone have a library to create an array of random numbers from normal >

Re: issue accessing Phoenix table from Spark

2016-10-21 Thread Jörn Franke
Have you verified that this class is in the fat jar? It looks that it misses some of the Hbase libraries ... > On 21 Oct 2016, at 11:45, Mich Talebzadeh wrote: > > Still does not work with Spark 2.0.0 on apache-phoenix-4.8.1-HBase-1.2-bin > > thanks > > Dr Mich

Re: Ensuring an Avro File is NOT Splitable

2016-10-20 Thread Jörn Franke
What is the use case of this? You will reduce performance significantly. Nevertheless, the way you propose is the way to go, but I do not recommend it. > On 20 Oct 2016, at 14:00, Ashan Taha wrote: > > Hi > > What’s the best way to make sure an Avro file is NOT Splitable

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-18 Thread Jörn Franke
Careful Hbase with Phoenix is only in certain scenarios faster. When it is about processing small amounts out of a bigger amount of data (depends on node memory, the operation etc). Hive+tez+orc can be rather competitive, llap makes sense for interactive ad-hoc queries that are rather

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
stributed NoSQL engine. > Remember Big Data isn’t relational its more of a hierarchy model or record > model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …) > > >> On Oct 17, 2016, at 3:45 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >>

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
gt; required. This is also goes for the REST Endpoints. 3rd party services will >>> hit ours to update our data with no need to read from our data. And, when >>> we want to update their data, we will hit theirs to update their data using >>> a triggered job. >>> &g

Re: Spark ML OOM problem

2016-10-12 Thread Jörn Franke
Which Spark version? Are you using RDDs? Or datasets? What type are the features? If string how large? Is it spark standalone? How do you train/configure the algorithm. How do you initially parse the data? The standard driver and executor logs could be helpful. > On 12 Oct 2016, at 09:24, 陈哲

Re: Spark Streaming Advice

2016-10-10 Thread Jörn Franke
Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes. > On 10 Oct 2016, at 16:25, Kevin Mellott wrote: > > Whilst working on this application, I found a setting that drastically > improved the

Re: How to use Spark-Scala to download a CSV file from the web?

2016-09-25 Thread Jörn Franke
Use a tool like flume and/or oozie to reliable download files from http and do error handling (e.g. Requests time out). Afterwards process the data with spark. > On 25 Sep 2016, at 10:27, Dan Bikle wrote: > > hello spark-world, > > How to use Spark-Scala to download a CSV

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Jörn Franke
As Cody said, Spark is not going to help you here. There are two issues you need to look at here: duplicated (or even more) messages processed by two different processes and the case of failure of any component (including the message broker). Keep in mind that duplicated messages can even

Re: Redshift Vs Spark SQL (Thrift)

2016-09-23 Thread Jörn Franke
Depends what your use case is. A generic benchmark does not make sense, because they are different technologies for different purposes. > On 23 Sep 2016, at 06:09, ayan guha wrote: > > Hi > > Is there any benchmark or point of view in terms of pros and cons between AWS >

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
will kill the process because it's using more >> >> memory than it asked for. A JVM is always going to use a little >> >> off-heap memory by itself, so setting a max heap size of 2GB means the >> >> JVM process may use a bit more than 2GB of memory. With an off-heap >>

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Jörn Franke
tensive app like Spark it can be a lot more. > > There's a built-in 10% overhead, so that if you ask for a 3GB executor > it will ask for 3.3GB from YARN. You can increase the overhead. > > On Wed, Sep 21, 2016 at 11:41 PM, Jörn Franke <jornfra...@gmail.com> > wrote: > &

Re: Memory usage by Spark jobs

2016-09-22 Thread Jörn Franke
You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed. Then,

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Jörn Franke
All off-heap memory is still managed by the JVM process. If you limit the memory of this process then you limit the memory. I think the memory of the JVM process could be limited via the xms/xmx parameter of the JVM. This can be configured via spark options for yarn (be aware that they are

<    1   2   3   4   5   6   >