Re: Log4j 1.2.17 spark CVE

2021-12-14 Thread Steve Loughran
log4j 1.2.17 is not vulnerable. There is an existing CVE there from a log aggregation servlet; Cloudera products ship a patched release with that servlet stripped...asf projects are not allowed to do that. But: some recent Cloudera Products do include log4j 2.x, so colleagues of mine are busy

Re: Missing module spark-hadoop-cloud in Maven central

2021-06-02 Thread Steve Loughran
off the record: Really irritates me too, as it forces me to do local builds even though I shouldn't have to. Sometimes I do that for other reasons, but still. Getting the cloud-storage module in was hard enough at the time that I wasn't going to push harder; I essentially stopped trying to get

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
al.io.cloud.PathOutputCommitProtocol"); > hadoopConfiguration.set("spark.sql.parquet.output.committer.class", > "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"); > hadoopConfiguration.set("fs.s3a.connection.maximum", > Integer.toString(coreCount

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-29 Thread Steve Loughran
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. using a different aws sdk jar is a bit risky, though more recent upgrades have all be fairly low stress On Fri, 19 Jun 2020 at 05:39, murat

Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-26 Thread Steve Loughran
On 25 Jun 2018, at 23:59, Farshid Zavareh mailto:fhzava...@gmail.com>> wrote: I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two

Re: Palantir replease under org.apache.spark?

2018-01-11 Thread Steve Loughran
On 9 Jan 2018, at 18:10, Sean Owen > wrote: Just to follow up -- those are actually in a Palantir repo, not Central. Deploying to Central would be uncourteous, but this approach is legitimate and how it has to work for vendors to release distros

Re: Writing files to s3 with out temporary directory

2017-12-01 Thread Steve Loughran
Hadoop trunk (i.e 3.1 when it comes out), has the code to do 0-rename commits http://steveloughran.blogspot.co.uk/2017/11/subatomic.html if you want to play today, you can build Hadoop trunk & spark master, + a little glue JAR of mine to get Parquet to play properly

Re: Process large JSON file without causing OOM

2017-11-15 Thread Steve Loughran
On 14 Nov 2017, at 15:32, Alec Swan > wrote: But I wonder if there is a way to stream/batch the content of JSON file in order to convert it to ORC piecemeal and avoid reading the whole JSON file in memory in the first place? That is what

Re: Anyone knows how to build and spark on jdk9?

2017-10-30 Thread Steve Loughran
On 27 Oct 2017, at 19:24, Sean Owen > wrote: Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is in place already, and the last issue is dealing with how Scala closures are now implemented quite different with lambdas /

Re: Why does Spark need to set log levels

2017-10-12 Thread Steve Loughran
> On 9 Oct 2017, at 16:49, Daan Debie wrote: > > Hi all! > > I would love to use Spark with a somewhat more modern logging framework than > Log4j 1.2. I have Logback in mind, mostly because it integrates well with > central logging solutions such as the ELK stack. I've

Re: Quick one... AWS SDK version?

2017-10-04 Thread Steve Loughran
via the s3a client, you will need the SDK version to match the hadoop-aws JAR of the same version of Hadoop your JARs have. Similarly, if you were using spark-kinesis, it needs to be in sync there. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, October 03, 2017 2:20 PM To: JG

Re: Quick one... AWS SDK version?

2017-10-03 Thread Steve Loughran
On 3 Oct 2017, at 02:28, JG Perrin > wrote: Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? You generally to have to stick with the version which hadoop was built with I'm

Re: how do you deal with datetime in Spark?

2017-10-03 Thread Steve Loughran
On 3 Oct 2017, at 18:43, Adaryl Wakefield > wrote: I gave myself a project to start actually writing Spark programs. I’m using Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful

Re: More instances = slower Spark job

2017-10-01 Thread Steve Loughran
On 28 Sep 2017, at 15:27, Daniel Siegmann > wrote: Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text file? Does it use InputFormat do create multiple splits and creates 1 partition per split?

Re: More instances = slower Spark job

2017-10-01 Thread Steve Loughran
> On 28 Sep 2017, at 14:45, ayan guha wrote: > > Hi > > Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text > file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Yes, Input formats give you their splits,

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 20:03, JG Perrin > wrote: You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. no, it doesn't work quite like that. 1. workers generate their data and

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 15:59, Alexander Czech > wrote: Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3

Re: More instances = slower Spark job

2017-09-28 Thread Steve Loughran
On 28 Sep 2017, at 09:41, Jeroen Miller > wrote: Hello, I am experiencing a disappointing performance issue with my Spark jobs as I scale up the number of instances. The task is trivial: I am loading large (compressed) text files from S3,

Re: CSV write to S3 failing silently with partial completion

2017-09-08 Thread Steve Loughran
On 7 Sep 2017, at 18:36, Mcclintic, Abbi > wrote: Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with

Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Steve Loughran
On 19 Aug 2017, at 02:42, Imtiaz Ahmed > wrote: Hi All, I am building a spark library which developers will use when writing their spark jobs to get access to data on Azure Data Lake. But the authentication will depend on the dataset they

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-11 Thread Steve Loughran
On 10 Aug 2017, at 09:51, Hemanth Gudela > wrote: Yeah, installing HDFS in our environment is unfornutately going to take lot of time (approvals/planning etc). I will have to live with local FS for now. The other option I had

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Steve Loughran
> On 3 Aug 2017, at 19:59, Marco Mistroni wrote: > > Hello > my 2 cents here, hope it helps > If you want to just to play around with Spark, i'd leave Hadoop out, it's an > unnecessary dependency that you dont need for just running a python script > Instead do the

Re: SPARK Issue in Standalone cluster

2017-08-03 Thread Steve Loughran
On 2 Aug 2017, at 20:05, Gourav Sengupta > wrote: Hi Steve, I have written a sincere note of apology to everyone in a separate email. I sincerely request your kind forgiveness before hand if anything does sound impolite in my

Re: PySpark Streaming S3 checkpointing

2017-08-02 Thread Steve Loughran
On 2 Aug 2017, at 10:34, Riccardo Ferrari > wrote: Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this: class StreamingJob(): def __init__(..): ...

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Steve Loughran
On 2 Aug 2017, at 14:25, Gourav Sengupta > wrote: Hi, I am definitely sure that at this point of time everyone who has kindly cared to respond to my query do need to go and check this link

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-12 Thread Steve Loughran
On 10 Jul 2017, at 21:57, Everett Anderson <ever...@nuna.com<mailto:ever...@nuna.com>> wrote: Hey, Thanks for the responses, guys! On Thu, Jul 6, 2017 at 7:08 AM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 5 Jul 2017,

Re: Using Spark as a simulator

2017-07-07 Thread Steve Loughran
On 7 Jul 2017, at 08:37, Esa Heikkinen > wrote: I only want to simulate very huge "network" with even millions parallel time syncronized actors (state machines). There are also communication between actors via some (key-value

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-06 Thread Steve Loughran
On 5 Jul 2017, at 14:40, Vadim Semenov > wrote: Are you sure that you use S3A? Because EMR says that they do not support S3A https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/ > Amazon EMR does not

Re: Spark querying parquet data partitioned in S3

2017-07-05 Thread Steve Loughran
> On 29 Jun 2017, at 17:44, fran wrote: > > We have got data stored in S3 partitioned by several columns. Let's say > following this hierarchy: > s3://bucket/data/column1=X/column2=Y/parquet-files > > We run a Spark job in a EMR cluster (1 master,3 slaves) and

Re: Question on Spark code

2017-06-26 Thread Steve Loughran
On 25 Jun 2017, at 20:57, kant kodali > wrote: impressive! I need to learn more about scala. What I mean stripping away conditional check in Java is this. static final boolean isLogInfoEnabled = false; public void logMessage(String message) {

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Steve Loughran
On 23 Jun 2017, at 10:22, Saisai Shao > wrote: Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute

Re: Using YARN w/o HDFS

2017-06-23 Thread Steve Loughran
you'll need a filesystem with * consistency * accessibility everywhere * supports a binding through one of the hadoop fs connectors NFS-style distributed filesystems work with file:// ; things like glusterfs need their own connectors. you can use azure's wasb:// as a drop in replacement for

Re: SparkSQL not able to read a empty table location

2017-05-20 Thread Steve Loughran
On 20 May 2017, at 01:44, Bajpai, Amit X. -ND > wrote: Hi, I have a hive external table with the S3 location having no files (but the S3 location directory does exists). When I am trying to use Spark SQL to count the number of records in

Re: Spark <--> S3 flakiness

2017-05-18 Thread Steve Loughran
sting in the next day or so. Thanks! Gary On 17 May 2017 at 03:19, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 17 May 2017, at 06:00, lucas.g...@gmail.com<mailto:lucas.g...@gmail.com> wrote: Steve, thanks for the reply. Digging through

Re: s3 bucket access/read file

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 00:10, jazzed > wrote: How did you solve the problem with V4? which v4 problem? Authentication? you need to declare the explicit s3a endpoint via fs.s3a.endpoint , otherwise you get a generic "bad auth" message which

Re: Parquet file amazon s3a timeout

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 11:13, Karin Valisova > wrote: Hello! I'm working with some parquet files saved on amazon service and loading them to dataframe with Dataset df = spark.read() .parquet(parketFileLocation); however, after some time I get the

Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html On 16 May 2017 at 10:10, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 11 May 2017, at 06:07, lucas.g...@gmail.com<mailto:lucas.g...@gmail.com> wrote: Hi users, we have

Re: Spark <--> S3 flakiness

2017-05-16 Thread Steve Loughran
On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. Please don't, not without a committer specially written to work against S3 in the

Re: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-16 Thread Steve Loughran
On 10 May 2017, at 13:40, Mendelson, Assaf > wrote: Hi all, When running spark I get the following warning: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java

Re: parquet optimal file structure - flat vs nested

2017-05-03 Thread Steve Loughran
> On 30 Apr 2017, at 09:19, Zeming Yu wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression that > flat files are more efficient than deeply nested files (say 3 or 4 levels > down). Is that correct? > > Thanks, > Zeming Where's the data

Re: removing columns from file

2017-05-01 Thread Steve Loughran
On 28 Apr 2017, at 16:10, Anubhav Agarwal > wrote: Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 old/dated blog post. If you get the Hadoop 2.8

Re: Questions related to writing data to S3

2017-04-24 Thread Steve Loughran
On 23 Apr 2017, at 19:49, Richard Hanson > wrote: I have a streaming job which writes data to S3. I know there are saveAs functions helping write data to S3. But it bundles all elements then writes out to S3. use Hadoop 2.8.x binaries and

Re: splitting a huge file

2017-04-24 Thread Steve Loughran
> On 21 Apr 2017, at 19:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to do this efficiently, we need to split the file into smaller files. > > I don't believe there is a way to do this with Spark,

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
applicable to building data products and data warehousing. I concur Regards, Gourav -Steve On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 11 Apr 2017, at 20:46, Gourav Sengupta <gourav.sengu...@gmail.com

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 11 Apr 2017, at 20:46, Gourav Sengupta > wrote: And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. While I'm

Re: optimising storage and ec2 instances

2017-04-11 Thread Steve Loughran
> On 11 Apr 2017, at 11:07, Zeming Yu wrote: > > Hi all, > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > Background: I have a data set growing by 6 TB p.a. I plan to use spark to > read in all

Re: unit testing in spark

2017-04-11 Thread Steve Loughran
(sorry sent an empty reply by accident) Unit testing is one of the easiest ways to isolate problems in an an internal class, things you can get wrong. But: time spent writing unit tests is time *not* spent writing integration tests. Which biases me towards the integration. What I do find is

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Steve Loughran
On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote: Hi Steve, Why would you ever do that? You are suggesting the use of a CI tool as a workflow and orchestration engine. Regards, Gourav Sengupta On Fri, Apr 7, 2017 at 4:

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Steve Loughran
On 7 Apr 2017, at 15:32, Alvaro Brandon > wrote: I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Steve Loughran
If you have Jenkins set up for some CI workflow, that can do scheduled builds and tests. Works well if you can do some build test before even submitting it to a remote cluster On 7 Apr 2017, at 10:15, Sam Elamin > wrote: Hi Shyla You

Re: httpclient conflict in spark

2017-03-30 Thread Steve Loughran
On 29 Mar 2017, at 14:42, Arvind Kandaswamy > wrote: Hello, I am getting the following error. I get this error when trying to use AWS S3. This appears to be a conflict with httpclient. AWS S3 comes with httplient-4.5.2.jar. I am not

Re: Spark and continuous integration

2017-03-14 Thread Steve Loughran
On 13 Mar 2017, at 13:24, Sam Elamin > wrote: Hi Jorn Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring tests pasts and linting on the code. I'd add "providing diagnostics when tests fail", which is a

Re: Wrong runtime type when using newAPIHadoopFile in Java

2017-03-06 Thread Steve Loughran
On 6 Mar 2017, at 12:30, Nira Amit > wrote: And it's very difficult if it's doing unexpected things. All serialisations do unexpected things. Nobody understands them. Sorry

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Steve Loughran
try giving a resource of a file in the JAR, e.g add a file "log4j-debugging.properties into the jar, and give a config option of -Dlog4j.configuration=/log4j-debugging.properties (maybe also try without the "/") On 26 Feb 2017, at 16:31, Prithish

Re: Get S3 Parquet File

2017-02-25 Thread Steve Loughran
On 24 Feb 2017, at 07:47, Femi Anthony > wrote: Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. I would absolutely not use s3n with a 1.2 GB file. There is a

Re: Will Spark ever run the same task at the same time

2017-02-20 Thread Steve Loughran
> On 16 Feb 2017, at 18:34, Ji Yan wrote: > > Dear spark users, > > Is there any mechanism in Spark that does not guarantee the idempotent > nature? For example, for stranglers, the framework might start another task > assuming the strangler is slow while the strangler is

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Steve Loughran
On 14 Feb 2017, at 11:12, Mendelson, Assaf > wrote: I know how to get the filesystem, the problem is that this means using Hadoop directly so if in the future we change to something else (e.g. S3) I would need to rewrite the code. well,

Re: How to measure IO time in Spark over S3

2017-02-13 Thread Steve Loughran
Hadoop 2.8's s3a does a lot more metrics here, most of which you can find on HDP-2.5 if you can grab those JARs. Everything comes out as hadoop JMX metrics, also readable & aggregatable through a call to FileSystem.getStorageStatistics Measuring IO time isn't something picked up, because it's

Re: using an alternative slf4j implementation

2017-02-06 Thread Steve Loughran
> On 6 Feb 2017, at 11:06, Mendelson, Assaf wrote: > > Found some questions (without answers) and I found some jira > (https://issues.apache.org/jira/browse/SPARK-4147 and > https://issues.apache.org/jira/browse/SPARK-14703), however they do not solve > the issue. >

Re: spark 2.02 error when writing to s3

2017-01-28 Thread Steve Loughran
On 27 Jan 2017, at 23:17, VND Tremblay, Paul > wrote: Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul netflix's s3mper: https://github.com/Netflix/s3mper EMR consistency:

Re: spark 2.02 error when writing to s3

2017-01-27 Thread Steve Loughran
aul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _ From: Neil Jonkers [mailto:neilod...@gmail.com] Sent: Friday, January 20, 2017 11:39 AM To: Steve Loughran; VND Tremblay, Pau

Re: spark 2.02 error when writing to s3

2017-01-20 Thread Steve Loughran
AWS S3 is eventually consistent: even after something is deleted, a LIST/GET call may show it. You may be seeing that effect; even after the DELETE has got rid of the files, a listing sees something there, And I suspect the time it takes for the listing to "go away" will depend on the total

Re: "Unable to load native-hadoop library for your platform" while running Spark jobs

2017-01-20 Thread Steve Loughran
On 19 Jan 2017, at 10:59, Sean Owen > wrote: It's a message from Hadoop libs, not Spark. It can be safely ignored. It's just saying you haven't installed the additional (non-Apache-licensed) native libs that can accelerate some operations. This is

Re: Anyone has any experience using spark in the banking industry?

2017-01-20 Thread Steve Loughran
> On 18 Jan 2017, at 21:50, kant kodali wrote: > > Anyone has any experience using spark in the banking industry? I have couple > of questions. > 2. How can I make spark cluster highly available across multi datacenter? Any > pointers? That's not, AFAIK, been a design

Re: Spark GraphFrame ConnectedComponents

2017-01-06 Thread Steve Loughran
On 5 Jan 2017, at 21:10, Ankur Srivastava > wrote: Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. set fs.default.name to

Re: Spark Read from Google store and save in AWS s3

2017-01-06 Thread Steve Loughran
On 5 Jan 2017, at 20:07, Manohar Reddy > wrote: Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls

Re: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Steve Loughran
On 5 Jan 2017, at 09:58, Manohar753 > wrote: Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some

Re: How to load a big csv to dataframe in Spark 1.6

2017-01-03 Thread Steve Loughran
On 31 Dec 2016, at 16:09, Raymond Xie > wrote: Hello Felix, I followed the instruction and ran the command: > $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 and I received the following error message:

Re: Question about Spark and filesystems

2017-01-03 Thread Steve Loughran
On 18 Dec 2016, at 19:50, joa...@verona.se wrote: Since each Spark worker node needs to access the same files, we have tried using Hdfs. This worked, but there were some oddities making me a bit uneasy. For dependency hell reasons I compiled a modified Spark, and this

Re: Gradle dependency problem with spark

2016-12-16 Thread Steve Loughran
FWIW, although the underlying Hadoop declared guava dependency is pretty low, everything in org.apache.hadoop is set up to run against later versions. It just sticks with the old one to avoid breaking anything donwstream which does expect a low version number. See HADOOP-10101 for the ongoing

Re: Handling Exception or Control in spark dataframe write()

2016-12-16 Thread Steve Loughran
> On 14 Dec 2016, at 18:10, bhayat wrote: > > Hello, > > I am writing my RDD into parquet format but what i understand that write() > method is still experimental and i do not know how i will deal with possible > exceptions. > > For example: > >

Re: Few questions on reliability of accumulators value.

2016-12-15 Thread Steve Loughran
On 12 Dec 2016, at 19:57, Daniel Siegmann > wrote: Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation:

Re: WARN util.NativeCodeLoader

2016-12-12 Thread Steve Loughran
> On 8 Dec 2016, at 06:38, baipeng wrote: > > Hi ALL > > I’m new to Spark.When I execute spark-shell, the first line is as > follows > WARN util.NativeCodeLoader: Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable. >

Re: Access multiple cluster

2016-12-05 Thread Steve Loughran
if the remote filesystem is visible from the other, than a different HDFS value, e.g hdfs://analytics:8000/historical/ can be used for reads & writes, even if your defaultFS (the one where you get max performance) is, say hdfs://processing:8000/ -performance will be slower, in both directions

Re: What benefits do we really get out of colocation?

2016-12-03 Thread Steve Loughran
On 3 Dec 2016, at 09:16, Manish Malhotra > wrote: thanks for sharing number as well ! Now a days even network can be with very high throughput, and might out perform the disk, but as Sean mentioned data on network will have

Re: Spark ignoring partition names without equals (=) separator

2016-11-29 Thread Steve Loughran
On 29 Nov 2016, at 05:19, Prasanna Santhanam <t...@apache.org<mailto:t...@apache.org>> wrote: On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: irrespective of naming, know that deep directory trees are p

Re: Spark ignoring partition names without equals (=) separator

2016-11-28 Thread Steve Loughran
irrespective of naming, know that deep directory trees are performance killers when listing files on s3 and setting up jobs. You might actually be better off having them in the same directory and using a pattern like 2016-03-11-* as the pattten to find files. On 28 Nov 2016, at 04:18,

Re: Third party library

2016-11-27 Thread Steve Loughran
On 27 Nov 2016, at 02:55, kant kodali > wrote: I would say instead of LD_LIBRARY_PATH you might want to use java.library.path in the following way java -Djava.library.path=/path/to/my/library or pass java.library.path along with spark-submit

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-25 Thread Steve Loughran
pload problem from the conversion On 24 Nov 2016, at 18:44, vr spark <vrspark...@gmail.com<mailto:vrspark...@gmail.com>> wrote: Hi, The source file i have is on local machine and its pretty huge like 150 gb. How to go about it? On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran

Re: How to write a custom file system?

2016-11-22 Thread Steve Loughran
On 21 Nov 2016, at 17:26, Samy Dindane > wrote: Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileSystem from org.apache.hadoop.fs, but I am not sure

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-20 Thread Steve Loughran
On 19 Nov 2016, at 17:21, vr spark > wrote: Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2.

Re: Run spark with hadoop snapshot

2016-11-19 Thread Steve Loughran
I'd recommend you build a fill spark release with the new hadoop version; you should have built that locally earlier the same day (so that ivy/maven pick up the snapshot) dev/make-distribution.sh -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.9.0-SNAPSHOT; > On 18 Nov 2016, at 19:31, lminer

Re: Long-running job OOMs driver process

2016-11-18 Thread Steve Loughran
On 18 Nov 2016, at 14:31, Keith Bourgoin > wrote: We thread the file processing to amortize the cost of things like getting files from S3. Define cost here: actual $ amount, or merely time to read the data? If it's read times, you should really be

Re: Any with S3 experience with Spark? Having ListBucket issues

2016-11-18 Thread Steve Loughran
On 16 Nov 2016, at 22:34, Edden Burrow > wrote: Anyone dealing with a lot of files with spark? We're trying s3a with 2.0.1 because we're seeing intermittent errors in S3 where jobs fail and saveAsText file fails. Using pyspark. How many

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Steve Loughran
On 4 Nov 2016, at 01:37, Marcelo Vanzin > wrote: On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth > wrote: What is the purpose of the delegation token renewal (the one that is done automatically

Re: sanboxing spark executors

2016-11-04 Thread Steve Loughran
> On 4 Nov 2016, at 06:41, blazespinnaker wrote: > > Is there a good method / discussion / documentation on how to sandbox a spark > executor? Assume the code is untrusted and you don't want it to be able to > make un validated network connections or do unvalidated

Re: Spark 2.0 with Hadoop 3.0?

2016-10-29 Thread Steve Loughran
On 27 Oct 2016, at 23:04, adam kramer > wrote: Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases? Is there any reason why Hadoop 3.0 is a non-starter for use with Spark 2.0? The version of aws-sdk in 3.0 actually works for

Re: Spark security

2016-10-27 Thread Steve Loughran
On 13 Oct 2016, at 14:40, Mendelson, Assaf > wrote: Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in http://spark.apache.org/docs/latest/security.html) and had some questions.

Re: spark infers date to be timestamp type

2016-10-27 Thread Steve Loughran
CSV type inference isn't really ideal: it does a full scan of a file to determine this; you are doubling the amount of data you need to read. Unless you are just exploring files in your notebook, I'd recommend doing it once, getting the schema from it then using that as the basis for the code

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-24 Thread Steve Loughran
On 24 Oct 2016, at 20:32, Cheng Lian <lian.cs@gmail.com<mailto:lian.cs@gmail.com>> wrote: On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <lian.cs@gmail.com<mailto:lian.cs@gmail.com>> wrote: What versio

Re: Getting the IP address of Spark Driver in yarn-cluster mode

2016-10-24 Thread Steve Loughran
On 24 Oct 2016, at 19:34, Masood Krohy > wrote: Hi everyone, Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark

Re: Issues with reading gz files with Spark Streaming

2016-10-24 Thread Steve Loughran
ug the class: org.apache.spark.sql.execution.streaming.FileStreamSource On 22 October 2016 at 15:14, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: > On 21 Oct 2016, at 15:53, Nkechi Achara > <nkach...@googlemail.com<mailto:nkach...@googlemail

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-22 Thread Steve Loughran
On 22 Oct 2016, at 00:48, Chetan Khatri > wrote: Hello Cheng, Thank you for response. I am using spark 1.6.1, i am writing around 350 gz parquet part files for single table. Processed around 180 GB of Data using Spark. Are you writing

Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Steve Loughran
> On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > Hi, > > I am using Spark 1.5.0 to read gz files with textFileStream, but when new > files are dropped in the specified directory. I know this is only the case > with gz files as when i extract the file into the

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-20 Thread Steve Loughran
> On 19 Oct 2016, at 21:46, Jakob Odersky wrote: > > Another reason I could imagine is that files are often read from HDFS, > which by default uses line terminators to separate records. > > It is possible to implement your own hdfs delimiter finder, however > for arbitrary

Re: spark with kerberos

2016-10-19 Thread Steve Loughran
On 19 Oct 2016, at 00:18, Michael Segel > wrote: (Sorry sent reply via wrong account.. ) Steve, Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-) Usually you will end up having a local Kerberos set up

Re: About Error while reading large JSON file in Spark

2016-10-19 Thread Steve Loughran
On 18 Oct 2016, at 10:58, Chetan Khatri > wrote: Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files,

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Steve Loughran
On 18 Oct 2016, at 08:43, Chetan Khatri > wrote: Hello Community members, I am getting error while reading large JSON file in spark, the underlying read code can't handle more than 2^31 bytes in a single line: if (bytesConsumed >

Re: spark with kerberos

2016-10-18 Thread Steve Loughran
at you would be collecting the data to a client (edge node) before pushing it out to the secured cluster. Does that make sense? On Oct 14, 2016, at 1:32 PM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 13 Oct 2016, at 10:50, dbolshak &l

Re: spark with kerberos

2016-10-14 Thread Steve Loughran
On 13 Oct 2016, at 10:50, dbolshak > wrote: Hello community, We've a challenge and no ideas how to solve it. The problem, Say we have the following environment: 1. `cluster A`, the cluster does not use kerberos and we use it as a

  1   2   3   >