Re: Accessing AWS S3 in Frankfurt (v4 only - AWS4-HMAC-SHA256)

2015-03-21 Thread Steve Loughran
1. make sure your secret key doesn't have a / in it. If it does, generate a new key. 2. jets3t and hadoop JAR versions need to be in sync; jets3t 0.9.0 was picked up in Hadoop 2.4 and not AFAIK 3. Hadoop 2.6 has a new S3 client, s3a, which compatible with s3n data. It uses the AWS toolkit

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Steve Loughran
On 24 Mar 2015, at 02:10, Marcelo Vanzin van...@cloudera.com wrote: This happens most probably because the Spark 1.3 you have downloaded is built against an older version of the Hadoop libraries than those used by CDH, and those libraries cannot parse the container IDs generated by CDH.

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-26 Thread Steve Loughran
On 25 Mar 2015, at 21:54, roni roni.epi...@gmail.commailto:roni.epi...@gmail.com wrote: Is there any way that I can install the new one and remove previous version. I installed spark 1.3 on my EC2 master and set teh spark home to the new one. But when I start teh spark-shell I get -

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran
Note that even the Facebook four degrees of separation paper went down to a single machine running WebGraph (http://webgraph.di.unimi.it/) for the final steps, after running jobs in there Hadoop cluster to build the dataset for that final operation. The computations were performed on a

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran
On 30 Mar 2015, at 13:27, jay vyas jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote: Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that you can't rely on memory in distributed analytics...now maybe we are challenging the assumption

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Steve Loughran
It's worth adding that there's no guaranteed that re-evaluated work would be on the same host as before, and in the case of node failure, it is not guaranteed to be elsewhere. this means things that depend on host-local information is going to generate different numbers even if there are no

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Steve Loughran
On 21 Apr 2015, at 17:34, Richard Marscher rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote: - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this.

Re: spark-ec2 s3a filesystem support and hadoop versions

2015-04-24 Thread Steve Loughran
S3a isn't ready for production use on anything below Hadoop 2.7.0. I say that as the person who mentored in all the patches for it between Hadoop 2.6 2.7 you need everything in https://issues.apache.org/jira/browse/HADOOP-11571 in your code -Hadoop 2.6.0 doesn't have any of the HADOOP-11571

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Steve Loughran
the key thing would be to use different ZK paths for each cluster. You shouldn't need more than 2 ZK quorums even for a large (few thousand node) Hadoop clusters: one for the HA bits of the infrastructure (HDFS, YARN) and one for the applications to abuse. It's easy for apps using ZK to stick

Re: directory loader in windows

2015-04-27 Thread Steve Loughran
This a hadoop-side stack trace it looks like the code is trying to get the filesystem permissions by running %HADOOP_HOME%\bin\WINUTILS.EXE ls -F and something is triggering a null pointer exception. There isn't any HADOOP- JIRA with this specific stack trace in it, so it's not a

Re: How to debug Spark on Yarn?

2015-04-28 Thread Steve Loughran
On 27 Apr 2015, at 07:51, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: Spark 1.3 1. View stderr/stdout from executor from Web UI: when the job is running i figured out the executor that am suppose to see, and those two links show 4 special characters on browser. 2.

Re: TwitterUtils on Windows

2015-05-19 Thread Steve Loughran
On 19 May 2015, at 03:08, Justin Pihony justin.pih...@gmail.com wrote: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching

Re: EVent generation

2015-05-12 Thread Steve Loughran
I think you may want to try emailing things to the storm users list, not the spark one On 11 May 2015, at 15:42, Tyler Mitchell tyler.mitch...@actian.commailto:tyler.mitch...@actian.com wrote: I've had good success with splunk generator. https://github.com/coccyx/eventgen/blob/master/README.md

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-18 Thread Steve Loughran
On 16 May 2015, at 04:39, Anton Brazhnyk anton.brazh...@genesys.commailto:anton.brazh...@genesys.com wrote: For me it wouldn’t help I guess, because those newer classes would still be loaded by different classloader. What did work for me with 1.3.1 – removing of those classes from Spark’s jar

Re: Forbidded : Error Code: 403

2015-05-15 Thread Steve Loughran
On 15 May 2015, at 21:20, Mohammad Tariq donta...@gmail.com wrote: Thank you Ayan and Ted for the prompt response. It isn't working with s3n either. And I am able to download the file. In fact I am able to read the same file using s3 API without any issue. sounds like an S3n

Re: Yarn application state monitor thread dying on IOException

2015-04-11 Thread Steve Loughran
On 10 Apr 2015, at 13:40, Lorenz Knies m...@l1024.org wrote: i would consider it a bug, that the Yarn application state monitor” thread dies on an, i think even expected (at least in the java methods called further down the stack), exception. What do you think? Is it a problem, that we

Re: Processing Large Images in Spark?

2015-04-07 Thread Steve Loughran
On 6 Apr 2015, at 23:05, Patrick Young patrick.mckendree.yo...@gmail.commailto:patrick.mckendree.yo...@gmail.com wrote: does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the

Re: need info on Spark submit on yarn-cluster mode

2015-04-08 Thread Steve Loughran
This means the spark workers exited with code 15; probably nothing YARN related itself (unless there are classpath-related problems). Have a look at the logs of the app/container via the resource manager. You can also increase the time that logs get kept on the nodes themselves to something

Re: Access several s3 buckets, with credentials containing /

2015-06-05 Thread Steve Loughran
On 5 Jun 2015, at 08:03, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi list! My problem is quite simple. I need to access several S3 buckets, using different credentials.: ``` val c1 = sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv).count val c2

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Steve Loughran
On 2 Jun 2015, at 00:14, Dean Wampler deanwamp...@gmail.commailto:deanwamp...@gmail.com wrote: It would be nice to see the code for MapR FS Java API, but my google foo failed me (assuming it's open source)... I know that MapRFS is closed source, don't know about the java JAR. Why not ask

Re: Spark Job always cause a node to reboot

2015-06-05 Thread Steve Loughran
On 4 Jun 2015, at 15:59, Chao Chen kandy...@gmail.com wrote: But when I try to run the Pagerank from HiBench, it always cause a node to reboot during the middle of the work for all scala, java, and python versions. But works fine with the MapReduce version from the same benchmark. do

Re: FileOutputCommitter deadlock 1.3.1

2015-06-09 Thread Steve Loughran
On 8 Jun 2015, at 15:55, Richard Marscher rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote: Hi, we've been seeing occasional issues in production with the FileOutCommitter reaching a deadlock situation. We are writing our data to S3 and currently have speculation enabled. What

Re: Spark 1.4 History Server - HDP 2.2

2015-06-21 Thread Steve Loughran
On 20 Jun 2015, at 17:37, Ashish Soni asoni.le...@gmail.com wrote: Can any one help i am getting below error when i try to start the History Server I do not see any org.apache.spark.deploy.yarn.history.pakage inside the assembly jar not sure how to get that

Re: Abount Jobs UI in yarn-client mode

2015-06-20 Thread Steve Loughran
On 19 Jun 2015, at 16:48, Sea 261810...@qq.commailto:261810...@qq.com wrote: Hi, all: I run spark on yarn, I want to see the Jobs UI http://ip:4040/, but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ which can not be found. Why? Anyone can help? whenever you point

Re: Web UI vs History Server Bugs

2015-06-20 Thread Steve Loughran
On 17 Jun 2015, at 19:10, jcai jonathon@yale.edu wrote: Hi, I am running this on Spark stand-alone mode. I find that when I examine the web UI, a couple bugs arise: 1. There is a discrepancy between the number denoting the duration of the application when I run the history server

Re: Problem attaching to YARN

2015-06-22 Thread Steve Loughran
On 22 Jun 2015, at 04:08, Shawn Garbett shawn.garb...@gmail.commailto:shawn.garb...@gmail.com wrote: 2015-06-21 11:03:22,029 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=39288,containerID=container_1434751301309_0015_02_01]

Re: Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Steve Loughran
you are using a guava version on the classpath which your version of Hadoop can't handle. try a version 15 or build spark against Hadoop 2.7.0 On 24 Jun 2015, at 19:03, maxdml max...@cs.duke.edu wrote: Exception in thread main java.lang.NoSuchMethodError:

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Steve Loughran
On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.commailto:as...@live.com wrote: Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Don't blame the spark team, complain to the hadoop team for being slow to embrace the java 1.7 APIs for

Re: Spark launching without all of the requested YARN resources

2015-06-24 Thread Steve Loughran
On 24 Jun 2015, at 05:55, canan chen ccn...@gmail.commailto:ccn...@gmail.com wrote: Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun

Re: s3 - Can't make directory for path

2015-06-23 Thread Steve Loughran
On 23 Jun 2015, at 00:09, Danny kont...@dannylinden.de wrote: hi, have you tested s3://ww-sandbox/name_of_path/ instead of s3://ww-sandbox/name_of_path + make sure the bucket is there already. Hadoop s3 clients don't currently handle that step or have you test to add your file

Re: [Spark 1.4.0] java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-13 Thread Steve Loughran
That's the Tachyon FS there, which appears to be missing a method override. On 12 Jun 2015, at 19:58, Peter Haumer phau...@us.ibm.commailto:phau...@us.ibm.com wrote: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-12 Thread Steve Loughran
These are both really good posts: you should try and get them in to the documentation. with anything implementing dynamicness, there are some fun problems (a) detecting the delays in the workflow. There's some good ideas here (b) deciding where to address it. That means you need to monitor the

Re: Spark standalone mode and kerberized cluster

2015-06-16 Thread Steve Loughran
On 15 Jun 2015, at 15:43, Borja Garrido Bear kazebo...@gmail.commailto:kazebo...@gmail.com wrote: I tried running the job in a standalone cluster and I'm getting this: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException:

Re: Is it possible to see Spark jobs on MapReduce job history ? (running Spark on YARN cluster)

2015-06-12 Thread Steve Loughran
For that you need SPARK-1537 and the patch to go with it It is still the spark web UI, it just hands off storage and retrieval of the history to the underlying Yarn timeline server, rather than through the filesystem. You'll get to see things as they go along too. If you do want to try it,

Re: s3 bucket access/read file

2015-07-01 Thread Steve Loughran
s3a uses amazon's own libraries; it's tested against frankfurt too. you have to view s3a support in Hadoop 2.6 as beta-release: it works, with some issues. Hadoop 2.7.0+ has it all working now, though are left with the task of getting hadoop-aws and the amazon JAR onto your classpath via the

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Steve Loughran
On 29 Jun 2015, at 14:18, Dave Ariens dari...@blackberry.commailto:dari...@blackberry.com wrote: I'd like to toss out another idea that doesn't involve a complete end-to-end Kerberos implementation. Essentially, have the driver authenticate to Kerberos, instantiate a Hadoop file system, and

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Steve Loughran
On 27 Jun 2015, at 07:56, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Does YARN provide the token through that env variable you mentioned? Or how does YARN do this? Roughly: 1. client-side launcher creates the delegation tokens and adds them as byte[] data to the the

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Steve Loughran
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.commailto:daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb

Re: IPv6 support

2015-06-30 Thread Steve Loughran
On 24 Jun 2015, at 18:56, Kevin Liu kevin...@fb.commailto:kevin...@fb.com wrote: Continuing this thread beyond standalone - onto clusters, does anyone have experience successfully running any Spark cluster on IPv6 only (not dual stack) machines? More companies are moving to IPv6 and some such

Re: Spark standalone mode and kerberized cluster

2015-06-11 Thread Steve Loughran
That's spark on YARN in Kerberos In Spark 1.3 you can submit work to a Kerberized Hadoop cluster; once the tokens you passed up with your app submission expire (~72 hours) your job can't access HDFS any more. That's been addressed in Spark 1.4, where you can now specify a kerberos keytab for

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-17 Thread Steve Loughran
with the right ftp client JAR on your classpath (I forget which), you can use ftp:// a a source for a hadoop FS operation. you may even be able to use it as an input for some spark (non streaming job directly. On 14 Aug 2015, at 14:11, Varadhan, Jawahar

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Steve Loughran
there's a spark-submit.cmd file for windows. Does that work? On 27 Jul 2015, at 21:19, Proust GZ Feng pf...@cn.ibm.commailto:pf...@cn.ibm.com wrote: Hi, Spark Users Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of Cygwin support in bin/spark-class The changeset is

Re: Spark SQL support for Hive 0.14

2015-08-05 Thread Steve Loughran
` and `double` were now types, and UNION had evolved.) From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesday, August 4, 2015 11:53 PM To: Steve Loughran ste...@hortonworks.commailto:ste...@hortonworks.com Cc: Ishwardeep Singh ishwardeep.si...@impetus.co.inmailto:ishwardeep.si

Re: Unable to load native-hadoop library for your platform

2015-08-05 Thread Steve Loughran
, Steve Loughran ste...@hortonworks.com wrote: Think it may be needed on Windows, certainly if you start trying to work with local files. On 4 Aug 2015, at 00:34, Sean Owen so...@cloudera.com wrote: It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy

Re: Is it possible to disable AM page proxy in Yarn client mode?

2015-08-03 Thread Steve Loughran
the reason that redirect is there is for security reasons; in a kerberos enabled cluster the RM proxy does the authentication, then forwards the requests to the running application. There's no obvious way to disable it in the spark application master, and I wouldn't recommend doing this anyway,

Re: Standalone Cluster Local Authentication

2015-08-03 Thread Steve Loughran
On 3 Aug 2015, at 10:05, MrJew kouz...@gmail.com wrote: Hello, Similar to other cluster systems e.g Zookeeper, Actually, Zookeeper supports SASL authentication of your Kerberos tokens. https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zookeeper+and+SASL Hazelcast. Spark has the

Re: TCP/IP speedup

2015-08-02 Thread Steve Loughran
On 1 Aug 2015, at 18:26, Ruslan Dautkhanov dautkha...@gmail.commailto:dautkha...@gmail.com wrote: If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000) may increase bandwidth up to ~20%.

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Steve Loughran
On 2 Aug 2015, at 13:42, Sujit Pal sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote: There is no additional configuration on the external Solr host from my code, I am using the default HttpClient provided by HttpSolrServer. According to the Javadocs, you can pass in a HttpClient

Re: apache-spark 1.3.0 and yarn integration and spring-boot as a container

2015-07-30 Thread Steve Loughran
you need to fix your configuration so that the resource manager hostname/URL is set...that address there is the listen on any port path On 30 Jul 2015, at 10:47, Nirav Patel npa...@xactlycorp.commailto:npa...@xactlycorp.com wrote: 15/07/29 11:19:26 INFO client.RMProxy: Connecting to

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Steve Loughran
try looking at the causes and steps here https://wiki.apache.org/hadoop/BindException On 28 Jul 2015, at 09:22, Wayne Song wayne.e.s...@gmail.commailto:wayne.e.s...@gmail.com wrote: I made this message with the Nabble web interface; I included the stack trace there, but I guess it didn't

Re: Accessing S3 files with s3n://

2015-08-11 Thread Steve Loughran
On 10 Aug 2015, at 20:17, Akshat Aranya aara...@gmail.commailto:aara...@gmail.com wrote: Hi Jerry, Akhil, Thanks your your help. With s3n, the entire file is downloaded even while just creating the RDD with sqlContext.read.parquet(). It seems like even just opening and closing the

Re: Unable to load native-hadoop library for your platform

2015-08-04 Thread Steve Loughran
Think it may be needed on Windows, certainly if you start trying to work with local files. On 4 Aug 2015, at 00:34, Sean Owen so...@cloudera.com wrote: It won't affect you if you're not actually running Hadoop. But it's mainly things like Snappy/LZO compression which are implemented as

Re: Spark SQL support for Hive 0.14

2015-08-04 Thread Steve Loughran
Spark 1.3.1 1.4 only support Hive 0.13 Spark 1.5 is going to be released against Hive 1.2.1; it'll skip Hive .14 support entirely and go straight to the currently supported Hive release. See SPARK-8064 for the gory details On 3 Aug 2015, at 23:01, Ishwardeep Singh

Re: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-06 Thread Steve Loughran
There's no support for IAM roles in the s3n:// client code in Apache Hadoop ( HADOOP-9384 ); Amazon's modified EMR distro may have it.. The s3a filesystem adds it, —this is ready for production use in Hadoop 2.7.1+ (implicitly HDP 2.3; CDH 5.4 has cherrypicked the relevant patches.) I don't

Re: Help accessing protected S3

2015-07-23 Thread Steve Loughran
On 23 Jul 2015, at 10:47, Greg Anderson gregory.ander...@familysearch.org wrote: So when I go to ~/ephemeral-hdfs/bin/hadoop and check its version, it says Hadoop 2.0.0-cdh4.2.0. If I run pyspark and use the s3a address, things should work, right? What am I missing? And thanks so

Re: Help accessing protected S3

2015-07-23 Thread Steve Loughran
On 23 Jul 2015, at 01:50, Ewan Leith ewan.le...@realitymine.com wrote: I think the standard S3 driver used in Spark from the Hadoop project (S3n) doesn't support IAM role based authentication. However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2 scripts (I'm

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Steve Loughran
I wouldn't try to play with forwarding tunnelling; always hard to work out what ports get used everywhere, and the services like hostname==URL in paths. Can't you just set up an entry in the windows /etc/hosts file? It's what I do (on Unix) to talk to VMs On 25 Aug 2015, at 04:49, Dino

Re: NoSuchMethodException : com.google.common.io.ByteStreams.limit

2015-10-23 Thread Steve Loughran
just try dropping in that Jar. Hadoop core ships with an out of date guava JAR to avoid breaking old code downstream, but 2.7.x is designed to work with later versions too (i.e. it has moved off any of the now-removed methods. See https://issues.apache.org/jira/browse/HADOOP-10101 for the

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Steve Loughran
> On 22 Oct 2015, at 15:12, Ashish Shrowty wrote: > > I understand that there is some incompatibility with the API between Hadoop > 2.6/2.7 and Amazon AWS SDK where they changed a signature of >

Re: Spark_1.5.1_on_HortonWorks

2015-10-22 Thread Steve Loughran
On 22 Oct 2015, at 02:47, Ajay Chander > wrote: Thanks for your time. I have followed your inputs and downloaded "spark-1.5.1-bin-hadoop2.6" on one of the node say node1. And when I did a pie test everything seems to be working fine, except that

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
> On 26 Oct 2015, at 09:28, Jinfeng Li wrote: > > Replication factor is 3 and we have 18 data nodes. We check HDFS webUI, data > is evenly distributed among 18 machines. > every block in HDFS (usually 64-128-256 MB) is distributed across three machines, meaning 3

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-24 Thread Steve Loughran
On 24 Oct 2015, at 00:46, Lin Zhao > wrote: I have a spark on YARN deployed using Cloudera Manager 5.4. The installation went smoothly. But when I try to run spark-shell I get a long list of exceptions saying "failed to bind to: /public_ip_of_host:0"

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-24 Thread Steve Loughran
better wiki entry https://wiki.apache.org/hadoop/BindException

Re: Building spark-1.5.x and MQTT

2015-10-28 Thread Steve Loughran
> On 28 Oct 2015, at 13:19, Bob Corsaro wrote: > > Has anyone successful built this? I'm trying to determine if there is a > defect in the source package or something strange about my environment. I get > a FileNotFound exception on MQTTUtils.class during the build of the

Re: Loading Files from HDFS Incurs Network Communication

2015-10-26 Thread Steve Loughran
e APIs, it's not seamless to glue it up with the spark context metric registry On Mon, Oct 26, 2015 at 11:14 AM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: > On 26 Oct 2015, at 09:28, Jinfeng Li > <liji...@gmail.com<mailto:liji...@gmail

Re: how to run unit test for specific component only

2015-11-13 Thread Steve Loughran
try: mvn test -pl sql -DwildcardSuites=org.apache.spark.sql -Dtest=none On 12 Nov 2015, at 03:13, weoccc > wrote: Hi, I am wondering how to run unit test for specific spark component only. mvn test -DwildcardSuites="org.apache.spark.sql.*"

Re: HiveServer2 Thrift OOM

2015-11-13 Thread Steve Loughran
looks suspiciously like some thrift transport unmarshalling problem, THRIFT-2660 Spark 1.5 uses hive 1.2.1; it should have the relevant thrift JAR too. Otherwise, you could play with thrift JAR versions yourself —maybe it will work, maybe not... On 13 Nov 2015, at 00:29, Yana Kadiyska

Re: Spark Job is getting killed after certain hours

2015-11-17 Thread Steve Loughran
On 17 Nov 2015, at 15:39, Nikhil Gs > wrote: Hello Everyone, Firstly, thank you so much for the response. In our cluster, we are using Spark 1.3.0 and our cluster version is CDH 5.4.1. Yes, we are also using Kerbros in our cluster

Re: Protobuff 3.0 for Spark

2015-11-05 Thread Steve Loughran
> On 5 Nov 2015, at 00:12, Lan Jiang wrote: > > I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop > itself comes with protobuf 2.5. I think the steps apply to HDP too. You need > to do the following Protobuf.jar has been so brittle in the past

Re: Spark reading from S3 getting very slow

2015-11-05 Thread Steve Loughran
On 5 Nov 2015, at 02:03, Younes Naguib > wrote: Hi all, I’m reading large text files from s3. Sizes between from 30GB and 40GB. Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason! Here is my

Re: Save data to different S3

2015-10-30 Thread Steve Loughran
On 30 Oct 2015, at 18:05, William Li > wrote: Thanks for your response. My secret has a back splash (/) so it didn’t work… that's a recurrent problem with the hadoop/java s3 clients. Keep trying to regenerate a secret until you get one that works

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Steve Loughran
On 14 Oct 2015, at 20:56, Marco Mistroni > wrote: 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to a loopback/non-r eachable address: fe80:0:0:0:c5ed:a66d:9d95:5caa%wlan2, but we couldn't find any external IP address!

Re: s3a file system and spark deployment mode

2015-10-16 Thread Steve Loughran
> On 15 Oct 2015, at 19:04, Scott Reynolds wrote: > > List, > > Right now we build our spark jobs with the s3a hadoop client. We do this > because our machines are only allowed to use IAM access to the s3 store. We > can build our jars with the s3a filesystem and the

Re: spark-shell (1.5.1) not starting cleanly on Windows.

2015-10-21 Thread Steve Loughran
you've hit this https://wiki.apache.org/hadoop/WindowsProblems the next version of hadoop will fail with a more useful message, including that wiki link On 21 Oct 2015, at 00:36, Renato Perini > wrote: java.lang.RuntimeException:

Re: Notification on Spark Streaming job failure

2015-10-07 Thread Steve Loughran
On 7 Oct 2015, at 06:28, Krzysztof Zarzycki > wrote: Hi Vikram, So you give up using yarn-cluster mode of launching Spark jobs, is that right? AFAIK when using yarn-cluster mode, the launch process (spark-submit) monitors job running on YARN,

Re: spark multi tenancy

2015-10-07 Thread Steve Loughran
> On 7 Oct 2015, at 09:26, Dominik Fries wrote: > > Hello Folks, > > We want to deploy several spark projects and want to use a unique project > user for each of them. Only the project user should start the spark > application and have the corresponding packages

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-06 Thread Steve Loughran
On 6 Oct 2015, at 01:23, Andrew Or > wrote: Both the history server and the shuffle service are backward compatible, but not forward compatible. This means as long as you have the latest version of history server / shuffle service running in

Re: spark multi tenancy

2015-10-07 Thread Steve Loughran
On 7 Oct 2015, at 11:06, ayan guha > wrote: Can queues also be used to separate workloads? yes; that's standard practise. Different YARN queues can have different maximum memory & CPU, and you can even tag queues as "pre-emptible", so more

Re: Spark thrift service and Hive impersonation.

2015-10-06 Thread Steve Loughran
ifferent SQL engine underneath Thanks, Jagat Singh On Wed, Sep 30, 2015 at 9:37 PM, Vinay Shukla <vinayshu...@gmail.com<mailto:vinayshu...@gmail.com>> wrote: Steve is right, The Spark thing server does not profs page end user identity downstream yet. On Wednesday, September

Re: How to compile Spark with customized Hadoop?

2015-10-10 Thread Steve Loughran
During development, I'd recommend giving Hadoop a version ending with -SNAPSHOT, and building spark with maven, as mvn knows to refresh the snapshot every day. you can do this in hadoop with mvn versions:set 2.7.0.stevel-SNAPSHOT if you are working on hadoop branch-2 or trunk direct, they

Re: Problem installing Sparck on Windows 8

2015-10-13 Thread Steve Loughran
On 12 Oct 2015, at 23:11, Marco Mistroni > wrote: HI all i have downloaded spark-1.5.1-bin-hadoop.2.4 i have extracted it on my machine, but when i go to the \bin directory and invoke spark-shell i get the following exception Could anyone

Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran
On 11 Jul 2015, at 19:20, Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS

Re: YARN Labels

2015-11-17 Thread Steve Loughran
One of our clusters runs on AWS with a portion of the nodes being spot nodes. We would like to force the application master not to run on spot nodes. For what ever reason, application master is not able to recover in cases the node where it was running suddenly disappears, which is the case

Re: Spark Job is getting killed after certain hours

2015-11-17 Thread Steve Loughran
On 17 Nov 2015, at 02:00, Nikhil Gs > wrote: Hello Team, Below is the error which we are facing in our cluster after 14 hours of starting the spark submit job. Not able to understand the issue and why its facing the below error

Re: spark-submit stuck and no output in console

2015-11-17 Thread Steve Loughran
On 17 Nov 2015, at 09:54, Kayode Odeyemi > wrote: Initially, I submitted 2 jobs to the YARN cluster which was running for 2 days and suddenly stops. Nothing in the logs shows the root cause. 48 hours is one of those kerberos warning times (as is

Re: Data Security on Spark-on-HDFS

2015-08-31 Thread Steve Loughran
> On 31 Aug 2015, at 11:02, Daniel Schulz wrote: > > Hi guys, > > In a nutshell: does Spark check and respect user privileges when > reading/writing data. Yes, in a locked down YARN cluster —until your tokens expire > > I am curious about the data security

Re: Too many open files issue

2015-09-02 Thread Steve Loughran
On 31 Aug 2015, at 19:49, Sigurd Knippenberg > wrote: I know I can adjust the max open files allowed by the OS but I'd rather fix the underlaying issue. bumping up the OS handle limits is step #1 of installing a hadoop cluster

Re: Resource allocation issue - is it possible to submit a new job in existing application under a different user?

2015-09-03 Thread Steve Loughran
If its running the thrift server from hive, it's got a SQL API for you to connect to... On 3 Sep 2015, at 17:03, Dhaval Patel > wrote: I am accessing a shared cluster mode Spark environment. However, there is an existing application

Re: Too many open files issue

2015-09-02 Thread Steve Loughran
he Spark job doesn't release its file handles until the end of the job instead of doing that while my loop iterates. Sigurd On Wed, Sep 2, 2015 at 4:33 AM, Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote: On 31 Aug 2015, at 19:49, Sigurd Knippenberg

Re: How to read files from S3 from Spark local when there is a http proxy

2015-09-09 Thread Steve Loughran
s3a:// has a proxy option https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html s3n: apparently gets set up differently, though I've never tested it http://stackoverflow.com/questions/20241953/hadoop-distcp-to-s3-behind-http-proxy > On 8 Sep 2015, at 13:51, tariq

Re: change the spark version

2015-09-13 Thread Steve Loughran
> On 12 Sep 2015, at 09:14, Sean Owen wrote: > > This is a question for the CDH list. CDH 5.4 has Spark 1.3, and 5.5 > has 1.5. The best thing is to update CDH as a whole if you can. > > However it's pretty simple to just run a newer Spark assembly as a > YARN app. Don't

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Steve Loughran
On 15 Sep 2015, at 05:47, Lan Jiang > wrote: Hi, there, I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. However, I would like to use Protobuf 3 in my spark application so that I can use some new features such as Map

Re: hdfs-ha on mesos - odd bug

2015-09-15 Thread Steve Loughran
> On 15 Sep 2015, at 08:55, Adrian Bridgett wrote: > > Hi Sam, in short, no, it's a traditional install as we plan to use spot > instances and didn't want price spikes to kill off HDFS. > > We're actually doing a bit of a hybrid, using spot instances for the mesos >

Re: Spark thrift service and Hive impersonation.

2015-09-30 Thread Steve Loughran
On 30 Sep 2015, at 03:24, Mohammed Guller > wrote: Does each user needs to start own thrift server to use it? No. One of the benefits of the Spark Thrift Server is that it allows multiple users to share a single SparkContext. Most likely,

Re: automatic start of streaming job on failure on YARN

2015-10-02 Thread Steve Loughran
On 1 Oct 2015, at 16:52, Adrian Tanase > wrote: This happens automatically as long as you submit with cluster mode instead of client mode. (e.g. ./spark-submit —master yarn-cluster …) The property you mention would help right after that, although

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 14:56, Michal Čizmazia > wrote: To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with this approach?

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 07:10, Tathagata Das > wrote: Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia > wrote: Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Yes. Because

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Steve Loughran
On 22 Sep 2015, at 10:40, Akhil Das > wrote: or you can set it in the environment as: export AWS_ACCESS_KEY_ID= export AWS_SECRET_ACCESS_KEY= didn't think the Hadoop code looked at those. There aren't any references to the env

Re: WAL on S3

2015-09-18 Thread Steve Loughran
> On 17 Sep 2015, at 21:40, Tathagata Das wrote: > > Actually, the current WAL implementation (as of Spark 1.5) does not work with > S3 because S3 does not support flushing. Basically, the current > implementation assumes that after write + flush, the data is immediately

Re: Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-25 Thread Steve Loughran
On 25 Sep 2015, at 03:35, Zhang, Jingyu > wrote: I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped

  1   2   3   >