Re: Belief propagation algorithm is open sourced
Nice! I am especially interested in Bayesian Networks, which are only one of the many models that can be expressed by a factor graph representation. Do you do Bayesian Networks learning at scale (parameters and structure) with latent variables? Are you using publicly available tools for that? Which ones? LibDAI, which created the supported format, "supports parameter learning of conditional probability tables by Expectation Maximization" according to the documentation. Is it your reference tool? Bertrand On Thu, Dec 15, 2016 at 5:21 AM, Bryan Cutlerwrote: > I'll check it out, thanks for sharing Alexander! > > On Dec 13, 2016 4:58 PM, "Ulanov, Alexander" > wrote: > >> Dear Spark developers and users, >> >> >> HPE has open sourced the implementation of the belief propagation (BP) >> algorithm for Apache Spark, a popular message passing algorithm for >> performing inference in probabilistic graphical models. It provides exact >> inference for graphical models without loops. While inference for graphical >> models with loops is approximate, in practice it is shown to work well. The >> implementation is generic and operates on factor graph representation of >> graphical models. It handles factors of any order, and variable domains of >> any size. It is implemented with Apache Spark GraphX, and thus can scale to >> large scale models. Further, it supports computations in log scale for >> numerical stability. Large scale applications of BP include fraud detection >> in banking transactions and malicious site detection in computer >> networks. >> >> >> Source code: https://github.com/HewlettPackard/sandpiper >> >> >> Best regards, Alexander >> >
SparkContext#cancelJobGroup : is it safe? Who got burn? Who is alive?
Hi, I am wondering about the safety of the *SparkContext#cancelJobGroup* method that should allow to stop specific (ie not all) jobs inside a spark context. There is a big disclaimer ( https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/SparkContext.html#setJobGroup(java.lang.String,%20java.lang.String,%20boolean) . If interruptOnCancel is set to true for the job group, then job > cancellation will result in Thread.interrupt() being called on the job's > executor threads. This is useful to help ensure that the tasks are actually > stopped in a timely manner, but is off by default due to HDFS-1208, where > HDFS may respond to Thread.interrupt() by marking nodes as dead. I have two main questions : 1. What is the expected behavior if it is not interrupted on cancel? I am especially curious about the YARN case with HDFS but any info is welcome. 2. Who is or was using the *interruptOnCancel* ? Do you got burn? It is still working without any incident? Thanks in advance for any info, feedbacks and war stories. Bertrand Dechoux
Re: Replacing Esper with Spark Streaming?
The big question would be what feature of Esper your are using. Esper is a CEP solution. I doubt that Spark Streaming can do everything Esper does without any development. Spark (Streaming) is more a general-purpose platform. http://www.espertech.com/products/esper.php But I would be glad to be proven wrong (which also would implies EsperTech is dead, which I also doubt...) Bertrand On Mon, Sep 14, 2015 at 2:31 PM, Todd Nistwrote: > Stratio offers a CEP implementation based on Spark Streaming and the > Siddhi CEP engine. I have not used the below, but they may be of some > value to you: > > http://stratio.github.io/streaming-cep-engine/ > > https://github.com/Stratio/streaming-cep-engine > > HTH. > > -Todd > > On Sun, Sep 13, 2015 at 7:49 PM, Otis Gospodnetić < > otis.gospodne...@gmail.com> wrote: > >> Hi, >> >> I'm wondering if anyone has attempted to replace Esper with Spark >> Streaming or if anyone thinks Spark Streaming is/isn't a good tool for the >> (CEP) job? >> >> We are considering Akka or Spark Streaming as possible Esper replacements >> and would appreciate any input from people who tried to do that with either >> of them. >> >> Thanks, >> Otis >> -- >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >
Re: EOFException when I list all files in hdfs directory
Well, anyone can open an account on apache jira and post a new ticket/enhancement/issue/bug... Bertrand Dechoux On Fri, Jul 25, 2014 at 4:07 PM, Sparky gullo_tho...@bah.com wrote: Thanks for the suggestion. I can confirm that my problem is I have files with zero bytes. It's a known bug and is marked as a high priority: https://issues.apache.org/jira/browse/SPARK-1960 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10651.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark deployed by Cloudera Manager
Is there any documentation from cloudera on how to run Spark apps on CDH Manager deployed Spark ? Asking the cloudera community would be a good idea. http://community.cloudera.com/ In the end only Cloudera will fix quickly issues with CDH... Bertrand Dechoux On Wed, Jul 23, 2014 at 9:28 AM, Debasish Das debasish.da...@gmail.com wrote: I found the issue... If you use spark git and generate the assembly jar then org.apache.hadoop.io.Writable.class is packaged with it If you use the assembly jar that ships with CDH in /opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar, they don't put org.apache.hadoop.io.Writable.class in it.. That's weird... If I can run the spark app using bare bone java I am sure it will run with Ooyala's job server as well... On Wed, Jul 23, 2014 at 12:15 AM, buntu buntu...@gmail.com wrote: If you need to run Spark apps through Hue, see if Ooyala's job server helps: http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-deployed-by-Cloudera-Manager-tp10472p10474.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Large scale ranked recommendation
And you might want to apply clustering before. It is likely that every user and every item are not unique. Bertrand Dechoux On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath nick.pentre...@gmail.com wrote: It is very true that making predictions in batch for all 1 million users against the 10k items will be quite onerous in terms of computation. I have run into this issue too in making batch predictions. Some ideas: 1. Do you really need to generate recommendations for each user in batch? How are you serving these recommendations? In most cases, you only need to make recs when a user is actively interacting with your service or product etc. Doing it all in batch tends to be a big waste of computation resources. In our system for example we are serving them in real time (as a user arrives at a web page, say, our customer hits our API for recs), so we only generate the rec at that time. You can take a look at Oryx for this ( https://github.com/cloudera/oryx) though it does not yet support Spark, you may be able to save the model into the correct format in HDFS and have Oryx read the data. 2. If you do need to make the recs in batch, then I would suggest: (a) because you have few items, I would collect the item vectors and form a matrix. (b) broadcast that matrix (c) do a mapPartitions on the user vectors. Form a user matrix from the vectors in each partition (maybe create quite a few partitions to make each user matrix not too big) (d) do a value call on the broadcasted item matrix (e) now for each partition you have the (small) item matrix and a (larger) user matrix. Do a matrix multiply and you end up with a (U x I) matrix with the scores for each user in the partition. Because you are using BLAS here, it will be significantly faster than individually computed dot products (f) sort the scores for each user and take top K (g) save or collect and do whatever with the scores 3. in conjunction with (2) you can try throwing more resources at the problem too If you access the underlying Breeze vectors (I think the toBreeze method is private so you may have to re-implement it), you can do all this using Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot). Hope that helps Nick On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma sharm...@umn.edu wrote: Yes, thats what prediction should be doing, taking dot products or sigmoid function for each user,item pair. For 1 million users and 10 K items data there are 10 billion pairs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How does Spark speculation prevent duplicated work?
I haven't look at the implementation but what you would do with any filesystem is write to a file inside the workspace directory of the task. And then only the attempt of the task that should be kept will perform a move to the final path. The other attempts are simply discarded. For most filesystem (and that's the case for HDFS), a 'move' is a very simple and fast action because only the full path/name of the file change but not its content or where this content is physically stored. Executive speculation happens in Hadoop MapReduce. Spark has the same concept. As long as you apply functions with no side effect (ie the only impact is the returned results), then you just need to not take into account results from additional attempts of the same task/operator. Bertrand Dechoux On Tue, Jul 15, 2014 at 9:34 PM, Andrew Ash and...@andrewash.com wrote: Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another task starts up, wouldn't it also attempt to write to the same path? How is that de-conflicted? On Tue, Jul 15, 2014 at 3:02 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Mingyuan, According to my understanding, Spark processes the result generated from each partition by passing them to resultHandler (SparkContext.scala L1056) This resultHandler is usually just put the result in a driver-side array, the length of which is always partitions.size this design effectively ensures that the actions are idempotent e.g. the count is implemented as def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum even the task in the partition is duplicately executed, the result put in the array is the same At the same time, I think the Spark implementation ensures that the operation applied on the return value of SparkContext.runJob will not be triggered when the duplicate tasks are finished Because, when a task is finished, the code execution path is TaskSetManager.handleSuccessfulTask - DAGScheduler.taskEnded in taskEnded, it will trigger the CompletionEvent message handler, where DAGScheduler will check if (!job.finished(rt.outputid)) and rt.outputid is the partitionid so even the duplicate task invokes a CompletionEvent message, it will find job.finished(rt.outputId) has been true eventually Maybe I was wrong…just went through the code roughly, welcome to correct me Best, -- Nan Zhu On Tuesday, July 15, 2014 at 1:55 PM, Mingyu Kim wrote: Hi all, I was curious about the details of Spark speculation. So, my understanding is that, when “speculated” tasks are newly scheduled on other machines, the original tasks are still running until the entire stage completes. This seems to leave some room for duplicated work because some spark actions are not idempotent. For example, it may be counting a partition twice in case of RDD.count or may be writing a partition to HDFS twice in case of RDD.save*(). How does it prevent this kind of duplicated work? Mingyu Attachments: - smime.p7s
Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?
A patch proposal on the apache JIRA for Spark? https://issues.apache.org/jira/browse/SPARK/ Bertrand On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: And also that there is a small bug in implementation. As I mentioned this earlier also. This is my first time I am reporting some bug. So I just wanted to ask, that do your name come somewhere or do you get something after correcting/reporting some bug. So that i can mention that in my profile/resume( as I am moving towards my final year undergrad and will be sitting for company interviews.). If there is something like that which I can mention then I can go for the procedure you mentioned earlier. Otherwise I can simply mention the error here on the mailing list itself. Its a small bug (Bug according to me.) I m not trying to be selfish. Its just that if I get something that can help make my profile look strong then I shouldn't miss it at this stage. Thanks, On Thu, Jul 10, 2014 at 5:54 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Ya thanks. I can see that lambda is used as the parameter. On Thu, Jul 10, 2014 at 1:40 PM, Sean Owen so...@cloudera.com wrote: Have a look at the code maybe? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Yes there is a smoothing parameter, and yes from the looks of it it is simply additive / Laplace smoothing. It's been in there for a while. On Thu, Jul 10, 2014 at 6:55 AM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: The discussion is in context for spark 0.9.1 Does MLlib Naive Bayes implementation incorporates Laplase smoothing? Or any other smoothing? Or it doesn't encorporates any smoothing?? Please inform? Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka
Pig 0.13, Spark, Spork
Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Re: Shark vs Impala
For the second question, I would say it is mainly because the projects have not the same aim. Impala does have a cost-based optimizer and predicate propagation capability which is natural because it is interpreting pseudo-SQL query. In the realm of relational database, it is often not a good idea to compete against the optimizer, it is of course also true for 'BigData'. Bertrand On Sun, Jun 22, 2014 at 1:32 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi folks, I was looking at the benchmark provided by Cloudera at http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ . Is it real that Shark cannot execute some query if you don't have enough memory? And is it true/reliable that Impala overcome so much Spark when executing complex queries? Best, Flavio
Re: Is There Any Benchmarks Comparing C++ MPI with Spark
I guess you have to understand the difference of architecture. I don't know much about C++ MPI but it is basically MPI whereas Spark is inspired from Hadoop MapReduce and optimised for reading/writing large amount of data with a smart caching and locality strategy. Intuitively, if you have a high ratio CPU/message then MPI might be better. But what is the ratio is hard to say and in the end this ratio will depend on your specific application. Finally, in real life, this difference of performance due to the architecture may not be the only or the most important factor of choice like Michael already explained. Bertrand On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler mich...@tumra.com wrote: Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because C++ is faster than Java/Scala will be completely blown by the inordinate amount of time you spend debugging your code and/or reinventing the wheel to do even basic tasks like linear regression.* There are undoubtably some very specialised use-cases where MPI and its brethren still dominate for High Performance Computing tasks -- like for example the nuclear decay simulations run by the US Department of Energy on supercomputers where they've invested billions solving that use case. Spark is part of the wider Big Data ecosystem, and its biggest advantages are traction amongst internet scale companies, hundreds of developers contributing to it and a community of thousands using it. Need a distributed fault-tolerant file system? Use HDFS. Need a distributed/fault-tolerant message-queue? Use Kafka. Need to co-ordinate between your worker processes? Use Zookeeper. Need to run it on a flexible grid of computing resources and handle failures? Run it on Mesos! The barrier to entry to get going with Spark is very low, download the latest distribution and start the Spark shell. Language bindings for Scala / Java / Python are excellent meaning you spend less time writing boilerplate code, and more time solving problems. Even if you believe you *need* to use native code to do something specific, like fetching HD video frames from satellite video capture cards -- wrap it in a small native library and use the Java Native Access interface to call it from your Java/Scala code. Have fun, and if you get stuck we're here to help! MC On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote: Hi guys, We are making choices between C++ MPI and Spark. Is there any official comparation between them? Thanks a lot! Wei
Re: Hadoop 2.3 Centralized Cache vs RDD
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html We do not currently cache blocks which are under construction, corrupt, or otherwise incomplete. Have you tried with a file with more than 1 block? And dfs.namenode.path.based.cache.refresh.interval.ms might be too large? You might want to ask a broader mailing list. This is not related to Spark. Bertrand On Fri, May 16, 2014 at 2:56 AM, hequn cheng chenghe...@gmail.com wrote: I tried centralized cache step by step following the apache hadoop oficial website, but it seems centralized cache doesn't work. see : http://stackoverflow.com/questions/22293358/centralized-cache-failed-in-hadoop-2-3 . Can anyone succeed? 2014-05-15 5:30 GMT+08:00 William Kang weliam.cl...@gmail.com: Hi, Any comments or thoughts on the implications of the newly released feature from Hadoop 2.3 on the centralized cache? How different it is from RDD? Many thanks. Cao
Re: Real world
http://spark-summit.org ? Bertrand On Thu, May 8, 2014 at 2:05 AM, Ian Ferreira ianferre...@hotmail.comwrote: Folks, I keep getting questioned on real world experience of Spark as in mission critical production deployments. Does anyone have some war stories to share or know of resources to review? Cheers - Ian
Re: PySpark still reading only text?
Cool, thanks for the link. Bertrand Dechoux On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Also see: https://github.com/apache/spark/pull/455 This will add support for reading sequencefile and other inputformat in PySpark, as long as the Writables are either simple (primitives, maps and arrays of same), or reasonably simple Java objects. I'm about to push a change from MsgPack to Pyrolite for the serialization. Support for saving as sequencefile or inputformat could then also come after that. It would be based on saving the python pickle able objects as sequence file and being able to read those back. — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Thu, Apr 17, 2014 at 11:40 AM, Bertrand Dechoux decho...@gmail.comwrote: According to the Spark SQL documentation, indeed, this project allows python to be used while reading/writing table ie data which not necessarily in text format. Thanks a lot! Bertrand Dechoux On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux decho...@gmail.comwrote: Thanks for the IRA reference. I really need to look at Spark SQL. Am I right to understand that due to Spark SQL, hive data can be read (and it does not need to be a text format) and then 'classical' Spark can work on this extraction? It seems logical but I haven't worked with Spark SQL as of now. Does it also imply the reverse is true? That I can write data as hive data with spark SQL using results from a random (python) Spark application? Bertrand Dechoux On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Yes, this JIRA would enable that. The Hive support also handles HDFS. Matei On Apr 16, 2014, at 9:55 PM, Jesvin Jose frank.einst...@gmail.com wrote: When this is implemented, can you load/save an RDD of pickled objects to HDFS? On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Bertrand, We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161. In 1.0, one feature we do have now is the ability to load binary data from Hive using Spark SQL’s Python API. Later we will also be able to save to Hive. Matei On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux decho...@gmail.com wrote: Hi, I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case? From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed. Is it not possible to read back those RDDs? That is create a flow to parse everything and then, e.g. the next week, start from the binary, structured data? Technically, what is the difficulty? I would assume the code reading a binary python RDD or a binary python file to be quite similar. Where can I know more about this subject? Thanks in advance Bertrand -- We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch
PySpark still reading only text?
Hi, I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case? From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed. Is it not possible to read back those RDDs? That is create a flow to parse everything and then, e.g. the next week, start from the binary, structured data? Technically, what is the difficulty? I would assume the code reading a binary python RDD or a binary python file to be quite similar. Where can I know more about this subject? Thanks in advance Bertrand
Re: Hadoop Input Format - newAPIHadoopFile
I don't know the Spark issue but the Hadoop context is clear. old api - org.apache.hadoop.mapred new api - org.apache.hadoop.mapreduce You might only need to change your import. Regards Bertrand On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre pbarapa...@gmail.com wrote: Hi, Trying to read HDFS file with TextInputFormat. scala import org.apache.hadoop.mapred.TextInputFormat scala import org.apache.hadoop.io.{LongWritable, Text} scala val file2 = sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat](hdfs:// 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt) This is giving me the error. console:14: error: type arguments [org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text,org.apache.hadoop.mapred.TextInputFormat] conform to the bounds of none of the overloaded alternatives of value newAPIHadoopFile: [K, V, F : org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: org.apache.hadoop.conf.Configuration)org.apache.spark.rdd.RDD[(K, V)] and [K, V, F : org.apache.hadoop.mapreduce.InputFormat[K,V]](path: String)(implicit km: scala.reflect.ClassTag[K], implicit vm: scala.reflect.ClassTag[V], implicit fm: scala.reflect.ClassTag[F])org.apache.spark.rdd.RDD[(K, V)] val file2 = sc.newAPIHadoopFile[LongWritable,Text,TextInputFormat](hdfs:// 192.168.100.130:8020/user/hue/pig/examples/data/sonnets.txt) What is correct syntax if I want to use TextInputFormat. Also, how to use customInputFormat. Very silly question but I am not sure how and where to keep jar file containing customInputFormat class. Thanks Pariksheet -- Cheers, Pari