Spark assembly for YARN/CDH5
Does anyone know if there Spark assemblies are created and available for download that have been built for CDH5 and YARN? Thanks, Philip - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: creating a distributed index
After playing around with mapPartition I think this does exactly what I want. I can pass in a function to mapPartition that looks like this: def f1(iter: Iterator[String]): Iterator[MyIndex] = { val idx: MyIndex = new MyIndex() while (iter.hasNext) { val text: String = iter.next; idx.add(text) } List(idx).iterator } val indexRDD: RDD[MyIndex] = rdd.mapPartitions(f1, true) Now I can perform a match operation like so: val matchRdd: RDD[Match] = indexRDD.flatMap(index = index.`match`(myquery)) I'm sure it won't take much imagination to figure out how to the the matching in a batch way. If anyone has done anything along these lines I'd love to have some feedback. Thanks, Philip On 08/04/2014 09:46 AM, Philip Ogren wrote: This looks like a really cool feature and it seems likely that this will be extremely useful for things we are doing. However, I'm not sure it is quite what I need here. With an inverted index you don't actually look items up by their keys but instead try to match against some input string. So, if I created an inverted index on 1M strings/documents in a single JVM I could subsequently submit a string-valued query to retrieve the n-best matches. I'd like to do something like build ten 1M string/document indexes across ten nodes, submit a string-valued query to each of the ten indexes and aggregate the n-best matches from the ten sets of results. Would this be possible with IndexedRDD or some other feature of Spark? Thanks, Philip On 08/01/2014 04:26 PM, Ankur Dave wrote: At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com wrote: It seems that I could do this with mapPartition so that each element in a partition gets added to an index for that partition. [...] Would it then be possible to take a string and query each partition's index with it? Or better yet, take a batch of strings and query each string in the batch against each partition's index? I proposed a key-value store based on RDDs called IndexedRDD that does exactly what you described. It uses mapPartitions to construct an index within each partition, then exposes get and multiget methods to allow looking up values associated with given keys. It will hopefully make it into Spark 1.2.0. Until then you can try it out by merging in the pull request locally: https://github.com/apache/spark/pull/1297. See JIRA for details and slides on how it works: https://issues.apache.org/jira/browse/SPARK-2365. Ankur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
relationship of RDD[Array[String]] to Array[Array[String]]
It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or a local collection without having to have parallel implementations? I can't tell that RDD or Array share any supertypes or traits by looking at the respective scaladocs. Perhaps implicit conversions could be used here. What I would like to do is have a single function whose body is like this: myData.map(myFx) where myData could be an RDD[Array[String]] (for example) or an Array[Array[String]]. Has anyone had success doing this? Thanks, Philip
Re: relationship of RDD[Array[String]] to Array[Array[String]]
Thanks Michael, That is one solution that I had thought of. It seems like a bit of overkill for the few methods I want to do this for - but I will think about it. I guess I was hoping that I was missing something more obvious/easier. Philip On 07/21/2014 11:20 AM, Michael Malak wrote: It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the corresponding pre-existing library implementations. Writing all those forwarding method calls is tedious, but Scala provides at least one bit of syntactic sugar, which alleviates having to type in twice the parameter lists for each method: http://stackoverflow.com/questions/8230831/is-method-parameter-forwarding-possible-in-scala I'm not seeing a way to utilize implicit conversions in this case. Since Scala is statically (albeit inferred) typed, I don't see a way around having a common supertype. On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote: It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or a local collection without having to have parallel implementations? I can't tell that RDD or Array share any supertypes or traits by looking at the respective scaladocs. Perhaps implicit conversions could be used here. What I would like to do is have a single function whose body is like this: myData.map(myFx) where myData could be an RDD[Array[String]] (for example) or an Array[Array[String]]. Has anyone had success doing this? Thanks, Philip
Re: Announcing Spark 1.0.1
Hi Patrick, This is great news but I nearly missed the announcement because it had scrolled off the folder view that I have Spark users list messages go to. 40+ new threads since you sent the email out on Friday evening. You might consider having someone on your team create a spark-announcement list so that it is easier to disseminate important information like this release announcement. Thanks again for all your hard work. I know you and the rest of the team are getting a million requests a day Philip On 07/11/2014 07:35 PM, Patrick Wendell wrote: I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for JSON data and performance and stability fixes. Visit the release notes[1] to read about this release or download[2] the release today. [1] http://spark.apache.org/releases/spark-release-1-0-1.html [2] http://spark.apache.org/downloads.html
Multiple SparkContexts with different configurations in same JVM
In various previous versions of Spark (and I believe the current version, 1.0.0, as well) we have noticed that it does not seem possible to have a local SparkContext and a SparkContext connected to a cluster via either a Spark Cluster (i.e. using the Spark resource manager) or a YARN cluster. Is this a known issue? If not, then I would be happy to write up a bug report on what the bad/unexpected behavior is and how to reproduce it. If so, then are there plans to fix this? Perhaps there is a Jira issue I could be pointed to or other related discussion. Thanks, Philip
Re: Unit test failure: Address already in use
In my unit tests I have a base class that all my tests extend that has a setup and teardown method that they inherit. They look something like this: var spark: SparkContext = _ @Before def setUp() { Thread.sleep(100L) //this seems to give spark more time to reset from the previous test's tearDown spark = new SparkContext(local, test spark) } @After def tearDown() { spark.stop spark = null //not sure why this helps but it does! System.clearProperty(spark.master.port) } It's been since last fall (i.e. version 0.8.x) since I've examined this code and so I can't vouch that it is still accurate/necessary - but it still works for me. On 06/18/2014 12:59 PM, Lisonbee, Todd wrote: Disabling parallelExecution has worked for me. Other alternatives I’ve tried that also work include: 1. Using a lock – this will let tests execute in parallel except for those using a SparkContext. If you have a large number of tests that could execute in parallel, this can shave off some time. object TestingSparkContext { val lock = new Lock() } // before you instantiate your local SparkContext TestingSparkContext.lock.acquire() // after you call sc.stop() TestingSparkContext.lock.release() 2. Sharing a local SparkContext between tests. - This is nice because your tests will run faster. Start-up and shutdown is time consuming (can add a few seconds per test). - The downside is that your tests are using the same SparkContext so they are less independent of each other. I haven’t seen issues with this yet but there are likely some things that might crop up. Best, Todd *From:*Anselme Vignon [mailto:anselme.vig...@flaminem.com] *Sent:* Wednesday, June 18, 2014 12:33 AM *To:* user@spark.apache.org *Subject:* Re: Unit test failure: Address already in use Hi, Could your problem come from the fact that you run your tests in parallel ? If you are spark in local mode, you cannot have concurrent spark instances running. this means that your tests instantiating sparkContext cannot be run in parallel. The easiest fix is to tell sbt to not run parallel tests. This can be done by adding the following line in your build.sbt: parallelExecution in Test := false Cheers, Anselme 2014-06-17 23:01 GMT+02:00 SK skrishna...@gmail.com mailto:skrishna...@gmail.com: Hi, I have 3 unit tests (independent of each other) in the /src/test/scala folder. When I run each of them individually using: sbt test-only test, all the 3 pass the test. But when I run them all using sbt test, then they fail with the warning below. I am wondering if the binding exception results in failure to run the job, thereby causing the failure. If so, what can I do to address this binding exception? I am running these tests locally on a standalone machine (i.e. SparkContext(local, test)). 14/06/17 13:42:48 WARN component.AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@3487b78d mailto:org.eclipse.jetty.server.Server@3487b78d: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:174) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:139) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:77) thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unit-test-failure-Address-already-in-use-tp7771.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Processing audio/video/images
I asked a question related to Marcelo's answer a few months ago. The discussion there may be useful: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html On 06/02/2014 06:09 PM, Marcelo Vanzin wrote: Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants. Or you could write the file list to a file, and then use sc.textFile() to open it (assuming one path per line), and the rest is pretty much the same as above. It will probably be hard to process each individual file in parallel, unless mp3 and jpg files can be split into multiple blocks that can be processed separately. In that case, you'd need a custom (Hadoop) input format that is able to calculate the splits. But it doesn't sound like that's what you want. On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, How do one process for data sources other than text? Lets say I have millions of mp3 (or jpeg) files and I want to use spark to process them? How does one go about it. I have never been able to figure this out.. Lets say I have this library in python which works like following: import audio song = audio.read_mp3(filename) Then most of the methods are attached to song or maybe there is another function which takes song type as an input. Maybe the above is just rambling.. but how do I use spark to process (say) audiio files. Thanks
Re: Use SparkListener to get overall progress of an action
Hi Pierre, I asked a similar question on this list about 6 weeks ago. Here is one answer http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E I got that is of particular note: In the upcoming release of Spark 1.0 there will be a feature that provides for exactly what you describe: capturing the information displayed on the UI in JSON. More details will be provided in the documentation, but for now, anything before 0.9.1 can only go through JobLogger.scala, which outputs information in a somewhat arbitrary format and will be deprecated soon. If you find this feature useful, you can test it out by building the master branch of Spark yourself, following the instructions in https://github.com/apache/spark/pull/42. On 05/22/2014 08:51 AM, Pierre B wrote: Is there a simple way to monitor the overall progress of an action using SparkListener or anything else? I see that one can name an RDD... Could that be used to determine which action triggered a stage, ... ? Thanks Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark unit testing best practices
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster because of different serialization requirements between the two. Perhaps it is different when using Kryo On 05/14/2014 04:34 AM, Andras Nemeth wrote: E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster.
Re: Opinions stratosphere
Great reference! I just skimmed through the results without reading much of the methodology - but it looks like Spark outperforms Stratosphere fairly consistently in the experiments. It's too bad the data sources only range from 2GB to 8GB. Who knows if the apparent pattern would extend out to 64GB, 128GB, 1TB, and so on... On 05/01/2014 06:02 PM, Christopher Nguyen wrote: Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of an in-memory dataset (e.g., RDD). In principle this could be argued to be an implementation detail; the operators and execution plan/data flow are of primary concern in the API, and the data representation/materializations are otherwise unspecified. But in practice, for long-running interactive applications, I consider RDDs to be of fundamental, first-class citizen importance, and the key distinguishing feature of Spark's model vs other in-memory approaches that treat memory merely as an implicit cache. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: I don’t know a lot about it except from the research side, where the team has done interesting optimization stuff for these types of applications. In terms of the engine, one thing I’m not sure of is whether Stratosphere allows explicit caching of datasets (similar to RDD.cache()) and interactive queries (similar to spark-shell). But it’s definitely an interesting project to watch. Matei On Nov 22, 2013, at 4:17 PM, Ankur Chauhan achau...@brightcove.com mailto:achau...@brightcove.com wrote: Hi, That's what I thought but as per the slides on http://www.stratosphere.eu they seem to know about spark and the scala api does look similar. I found the PACT model interesting. Would like to know if matei or other core comitters have something to weight in on. -- Ankur On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I've never seen that project before, would be interesting to get a comparison. Seems to offer a much lower level API. For instance this is a wordcount program: https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan achau...@brightcove.com mailto:achau...@brightcove.com wrote: Hi, I was just curious about https://github.com/stratosphere/stratosphere and how does spark compare to it. Anyone has any experience with it to make any comments? -- Ankur
RDD.tail()
Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to do something equivalent with a filter() call seem a bit contorted. Any thoughts? Thanks, Philip
Re: Is there a way to get the current progress of the job?
This is great news thanks for the update! I will either wait for the 1.0 release or go and test it ahead of time from git rather than trying to pull it out of JobLogger or creating my own SparkListener. On 04/02/2014 06:48 PM, Andrew Or wrote: Hi Philip, In the upcoming release of Spark 1.0 there will be a feature that provides for exactly what you describe: capturing the information displayed on the UI in JSON. More details will be provided in the documentation, but for now, anything before 0.9.1 can only go through JobLogger.scala, which outputs information in a somewhat arbitrary format and will be deprecated soon. If you find this feature useful, you can test it out by building the master branch of Spark yourself, following the instructions in https://github.com/apache/spark/pull/42. Andrew On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there are for a given active stage. Instead, it looks like all the data for the page is generated at once using information from the JobProgressListener. It doesn't seem like I have any way to programmatically access this information myself. I can't even instantiate my own JobProgressListener because it is spark package private. I could implement my SparkListener and gather up the information myself. It feels a bit awkward since classes like Task and TaskInfo are also spark package private. It does seem possible to gather up what I need but it seems like this sort of information should just be available without by implementing a custom SparkListener (or worse screen scraping the html generated by StageTable!) I was hoping that I would find the answer in MetricsServlet which is turned on by default. It seems that when I visit http://cluster:4040/metrics/json/ I should be able to get everything I want but I don't see the basic stage/task progress information I would expect. Are there special metrics properties that I should set to get this info? I think this would be the best solution - just give it the right URL and parse the resulting JSON - but I can't seem to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/
Re: Is there a way to get the current progress of the job?
I can appreciate the reluctance to expose something like the JobProgressListener as a public interface. It's exactly the sort of thing that you want to deprecate as soon as something better comes along and can be a real pain when trying to maintain the level of backwards compatibility that we all expect from commercial grade software. Instead of simply marking it private and therefore unavailable to Spark developers, it might be worth incorporating something like a @Beta annotation http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html which you could sprinkle liberally throughout Spark that communicates hey use this if you want to cause its here now and don't come crying if we rip it out or change it later. This might be better than simply marking so many useful functions/classes as private. I bet such an annotation could generate a compile warning/error for those who don't want to risk using them. On 04/02/2014 06:40 PM, Patrick Wendell wrote: Hey Phillip, Right now there is no mechanism for this. You have to go in through the low level listener interface. We could consider exposing the JobProgressListener directly - I think it's been factored nicely so it's fairly decoupled from the UI. The concern is this is a semi-internal piece of functionality and something we might, e.g. want to change the API of over time. - Patrick On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there are for a given active stage. Instead, it looks like all the data for the page is generated at once using information from the JobProgressListener. It doesn't seem like I have any way to programmatically access this information myself. I can't even instantiate my own JobProgressListener because it is spark package private. I could implement my SparkListener and gather up the information myself. It feels a bit awkward since classes like Task and TaskInfo are also spark package private. It does seem possible to gather up what I need but it seems like this sort of information should just be available without by implementing a custom SparkListener (or worse screen scraping the html generated by StageTable!) I was hoping that I would find the answer in MetricsServlet which is turned on by default. It seems that when I visit http://cluster:4040/metrics/json/ I should be able to get everything I want but I don't see the basic stage/task progress information I would expect. Are there special metrics properties that I should set to get this info? I think this would be the best solution - just give it the right URL and parse the resulting JSON - but I can't seem to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/
Re: Is there a way to get the current progress of the job?
What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there are for a given active stage. Instead, it looks like all the data for the page is generated at once using information from the JobProgressListener. It doesn't seem like I have any way to programmatically access this information myself. I can't even instantiate my own JobProgressListener because it is spark package private. I could implement my SparkListener and gather up the information myself. It feels a bit awkward since classes like Task and TaskInfo are also spark package private. It does seem possible to gather up what I need but it seems like this sort of information should just be available without by implementing a custom SparkListener (or worse screen scraping the html generated by StageTable!) I was hoping that I would find the answer in MetricsServlet which is turned on by default. It seems that when I visit http://cluster:4040/metrics/json/ I should be able to get everything I want but I don't see the basic stage/task progress information I would expect. Are there special metrics properties that I should set to get this info? I think this would be the best solution - just give it the right URL and parse the resulting JSON - but I can't seem to figure out how to do this or if it is possible. Any advice is appreciated. Thanks, Philip On 04/01/2014 09:43 AM, Philip Ogren wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/
RDD[URI]
In my Spark programming thus far my unit of work has been a single row from an hdfs file by creating an RDD[Array[String]] with something like: spark.textFile(path).map(_.split(\t)) Now, I'd like to do some work over a large collection of files in which the unit of work is a single file (rather than a row from a file.) Does Spark anticipate users creating an RDD[URI] or RDD[File] or some such and supporting actions and transformations that one might want to do on such an RDD? Any advice and/or code snippets would be appreciated! Thanks, Philip
Re: RDD[URI]
Thank you for the links! These look very useful. I do not have a precise use case - at this point I'm just exploring what is possible/feasible. Like the blog suggests, I might have a bunch of images lying around and might want to collect meta-data from them. In my case, I do a lot of NLP and so I would like to process text from a large collection of documents, perhaps after running through Tika. Both of these use cases seem closely related from a Spark user's perspective. On 1/30/2014 11:02 AM, Nick Pentreath wrote: What is the precise use case and reasoning behind wanting to work on a File as the record in an RDD? CombineFileInputFormat may be useful in some way: http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen c...@adatao.com mailto:c...@adatao.com wrote: Philip, I guess the key problem statement is the large collection of part? If so this may be helpful, at the HDFS level: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Otherwise you can always start with an RDD[fileUri] and go from there to an RDD[(fileUri, read_contents)]. Sent while mobile. Pls excuse typos etc. On Jan 30, 2014 9:13 AM, 尹绪森 yinxu...@gmail.com mailto:yinxu...@gmail.com wrote: I am also interested in this. My solution now is making a file to a line of string, i.e. deleting all '\n', then adding filename as the head of line with a space. [filename] [space] [content] Anyone have better ideas ? 2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com 写道: In my Spark programming thus far my unit of work has been a single row from an hdfs file by creating an RDD[Array[String]] with something like: spark.textFile(path).map(_.split(\t)) Now, I'd like to do some work over a large collection of files in which the unit of work is a single file (rather than a row from a file.) Does Spark anticipate users creating an RDD[URI] or RDD[File] or some such and supporting actions and transformations that one might want to do on such an RDD? Any advice and/or code snippets would be appreciated! Thanks, Philip
various questions about yarn-standalone vs. yarn-client
I have a few questions about yarn-standalone and yarn-client deployment modes that are described on the Launching Spark on YARN http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page. 1) Can someone give me a basic conceptual overview? I am struggling with understanding the difference between yarn-standalone and yarn-client deployment modes. I understand that yarn-standalone runs on the name node and that yarn-client can be run from a remote machine - but otherwise don't understand how they are different. It seems like having yarn-client is the obvious better approach because it can run from anywhere - but presumably, there is some advantage to having yarn-standalone (otherwise, why not just run yarn-client on the name node or from a remote machine.) I'm also curious to know what standalone refers to here. 2) I was able to run the SparkPi in yarn-client mode from a simple scala main method by providing only SPARK_JAR and SPARK_YARN_APP_JAR environment variables and by putting the various *-site.xml files on my classpath. That is, I didn't call run-example - just called my Scala app directly. We've had troubles duplicating this success on our own app and are in the process of applying the patch detailed here: https://github.com/apache/incubator-spark/pull/371 However, one think that I think I learned is that Spark doesn't have to be installed on the name node. Is that correct? Should I need to have Spark installed at all either on my remote machine or on the name node? It would be great if all that was needed were the SPARK_JAR and the SPARK_YARN_APP_JAR. 3) Finally, is it possible to pre-stage the assembly jar files so they don't need to be copied over every time I start a new Spark job in yarn-client mode? Any advice here is appreciated. Thanks! Philip
Re: Anyone know hot to submit spark job to yarn in java code?
Great question! I was writing up a similar question this morning and decided to investigate some more before sending. Here's what I'm trying. I have created a new scala project that contains only spark-examples-assembly-0.8.1-incubating.jar and spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar on the classpath and I am trying to create a yarn-client SparkContext with the following: val spark = new SparkContext(yarn-client, my-app) My hope is to run this on my laptop and have it execute/connect on the yarn application master. The hope is that if I can get this to work, then I can do the same from a web application. I'm trying to unpack run-example.sh, compute-classpath, SparkPi, *.yarn.Client to figure out what environment variables I need to set up etc. I grabbed all the .xml files out of my clusters conf directory (in my case /etc/hadoop/conf.cloudera.yarn) such as e.g. yarn-site.xml and put them on my classpath. I also set up environment variables SPARK_JAR, SPARK_YARN_APP_JAR, SPARK_YARN_USER_ENV, SPARK_HOME. When I run my simple scala script, I get the following error: Exception in thread main org.apache.spark.SparkException: Yarn application already ended,might be killed or not able to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApp(YarnClientSchedulerBackend.scala:95) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:72) at org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119) at org.apache.spark.SparkContext.init(SparkContext.scala:273) at SparkYarnClientExperiment$.main(SparkYarnClientExperiment.scala:14) at SparkYarnClientExperiment.main(SparkYarnClientExperiment.scala) I can look at my yarn UI and see that it registers a failed application, so I take this as incremental progress. However, I'm not sure how to troubleshoot what I'm doing from here or if what I'm trying to do is even sensible/possible. Any advice is appreciated. Thanks, Philip On 1/15/2014 11:25 AM, John Zhao wrote: Now I am working on a web application and I want to submit a spark job to hadoop yarn. I have already do my own assemble and can run it in command line by the following script: export YARN_CONF_DIR=/home/gpadmin/clusterConfDir/yarn export SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar ./spark-class org.apache.spark.deploy.yarn.Client --jar ./examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 1g --worker-memory 512m --worker-cores 1 It works fine. The I realized that it is hard to submit the job from a web application .Looks like the spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar or spark-examples-assembly-0.8.1-incubating.jar is a really big jar. I believe it contains everything . So my question is : 1) when I run the above script, which jar is beed submitted to the yarn server ? 2) It loos like the spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar plays the role of client side and spark-examples-assembly-0.8.1-incubating.jar goes with spark runtime and examples which will be running in yarn, am I right? 3) Does anyone have any similar experience ? I did lots of hadoop MR stuff and want follow the same logic to submit spark job. For now I can only find the command line way to submit spark job to yarn. I believe there is a easy way to integration spark in a web allocation. Thanks. John.
Re: Anyone know hot to submit spark job to yarn in java code?
My problem seems to be related to this: https://issues.apache.org/jira/browse/MAPREDUCE-4052 So, I will try running my setup from a Linux client and see if I have better luck. On 1/15/2014 11:38 AM, Philip Ogren wrote: Great question! I was writing up a similar question this morning and decided to investigate some more before sending. Here's what I'm trying. I have created a new scala project that contains only spark-examples-assembly-0.8.1-incubating.jar and spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar on the classpath and I am trying to create a yarn-client SparkContext with the following: val spark = new SparkContext(yarn-client, my-app) My hope is to run this on my laptop and have it execute/connect on the yarn application master. The hope is that if I can get this to work, then I can do the same from a web application. I'm trying to unpack run-example.sh, compute-classpath, SparkPi, *.yarn.Client to figure out what environment variables I need to set up etc. I grabbed all the .xml files out of my clusters conf directory (in my case /etc/hadoop/conf.cloudera.yarn) such as e.g. yarn-site.xml and put them on my classpath. I also set up environment variables SPARK_JAR, SPARK_YARN_APP_JAR, SPARK_YARN_USER_ENV, SPARK_HOME. When I run my simple scala script, I get the following error: Exception in thread main org.apache.spark.SparkException: Yarn application already ended,might be killed or not able to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApp(YarnClientSchedulerBackend.scala:95) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:72) at org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119) at org.apache.spark.SparkContext.init(SparkContext.scala:273) at SparkYarnClientExperiment$.main(SparkYarnClientExperiment.scala:14) at SparkYarnClientExperiment.main(SparkYarnClientExperiment.scala) I can look at my yarn UI and see that it registers a failed application, so I take this as incremental progress. However, I'm not sure how to troubleshoot what I'm doing from here or if what I'm trying to do is even sensible/possible. Any advice is appreciated. Thanks, Philip On 1/15/2014 11:25 AM, John Zhao wrote: Now I am working on a web application and I want to submit a spark job to hadoop yarn. I have already do my own assemble and can run it in command line by the following script: export YARN_CONF_DIR=/home/gpadmin/clusterConfDir/yarn export SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar ./spark-class org.apache.spark.deploy.yarn.Client --jar ./examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar --class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers 3 --master-memory 1g --worker-memory 512m --worker-cores 1 It works fine. The I realized that it is hard to submit the job from a web application .Looks like the spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar or spark-examples-assembly-0.8.1-incubating.jar is a really big jar. I believe it contains everything . So my question is : 1) when I run the above script, which jar is beed submitted to the yarn server ? 2) It loos like the spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar plays the role of client side and spark-examples-assembly-0.8.1-incubating.jar goes with spark runtime and examples which will be running in yarn, am I right? 3) Does anyone have any similar experience ? I did lots of hadoop MR stuff and want follow the same logic to submit spark job. For now I can only find the command line way to submit spark job to yarn. I believe there is a easy way to integration spark in a web allocation. Thanks. John.
rdd.saveAsTextFile problem
I have a very simple Spark application that looks like the following: var myRdd: RDD[Array[String]] = initMyRdd() println(myRdd.first.mkString(, )) println(myRdd.count) myRdd.saveAsTextFile(hdfs://myserver:8020/mydir) myRdd.saveAsTextFile(target/mydir/) The println statements work as expected. The first saveAsTextFile statement also works as expected. The second saveAsTextFile statement does not (even if the first is commented out.) I get the exception pasted below. If I inspect target/mydir I see that there is a directory called _temporary/0/_temporary/attempt_201401020953__m_00_1 which contains an empty part-0 file. It's curious because this code worked before with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be running this on Windows in local mode at the moment. Perhaps I should try running it on my linux box. Thanks, Philip Exception in thread main org.apache.spark.SparkException: Job aborted: Task 2.0:0 failed more than 0 times; aborting job java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
Re: rdd.saveAsTextFile problem
Not really. In practice I write everything out to HDFS and that is working fine. But I write lots of unit tests and example scripts and it is convenient to be able to test a Spark application (or sequence of spark functions) in a very local way such that it doesn't depend on any outside infrastructure (e.g. an HDFS server.) So, it is convenient to write out a small amount of data locally and manually inspect the results - esp. as I'm building up a unit or regression test. So, ultimately writing results out to a local file isn't that important to me. However, I was just trying to run a simple example script that worked before and is now not working. Thanks, Philip On 1/2/2014 10:28 AM, Andrew Ash wrote: You want to write it to a local file on the machine? Try using file:///path/to/target/mydir/ instead I'm not sure what behavior would be if you did this on a multi-machine cluster though -- you may get a bit of data on each machine in that local directory. On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a very simple Spark application that looks like the following: var myRdd: RDD[Array[String]] = initMyRdd() println(myRdd.first.mkString(, )) println(myRdd.count) myRdd.saveAsTextFile(hdfs://myserver:8020/mydir) myRdd.saveAsTextFile(target/mydir/) The println statements work as expected. The first saveAsTextFile statement also works as expected. The second saveAsTextFile statement does not (even if the first is commented out.) I get the exception pasted below. If I inspect target/mydir I see that there is a directory called _temporary/0/_temporary/attempt_201401020953__m_00_1 which contains an empty part-0 file. It's curious because this code worked before with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be running this on Windows in local mode at the moment. Perhaps I should try running it on my linux box. Thanks, Philip Exception in thread main org.apache.spark.SparkException: Job aborted: Task 2.0:0 failed more than 0 times; aborting job java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
Re: rdd.saveAsTextFile problem
I just tried your suggestion and get the same results with the _temporary directory. Thanks though. On 1/2/2014 10:28 AM, Andrew Ash wrote: You want to write it to a local file on the machine? Try using file:///path/to/target/mydir/ instead I'm not sure what behavior would be if you did this on a multi-machine cluster though -- you may get a bit of data on each machine in that local directory. On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a very simple Spark application that looks like the following: var myRdd: RDD[Array[String]] = initMyRdd() println(myRdd.first.mkString(, )) println(myRdd.count) myRdd.saveAsTextFile(hdfs://myserver:8020/mydir) myRdd.saveAsTextFile(target/mydir/) The println statements work as expected. The first saveAsTextFile statement also works as expected. The second saveAsTextFile statement does not (even if the first is commented out.) I get the exception pasted below. If I inspect target/mydir I see that there is a directory called _temporary/0/_temporary/attempt_201401020953__m_00_1 which contains an empty part-0 file. It's curious because this code worked before with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be running this on Windows in local mode at the moment. Perhaps I should try running it on my linux box. Thanks, Philip Exception in thread main org.apache.spark.SparkException: Job aborted: Task 2.0:0 failed more than 0 times; aborting job java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
Re: rdd.saveAsTextFile problem
Yep - that works great and is what I normally do. I perhaps should have framed my email as a bug report. The documentation for saveAsTextFile says you can write results out to a local file but it doesn't work for me per the described behavior. It also worked before and now it doesn't. So, it seems like a bug. Should I file a Jira issue? I haven't done that yet for this project but would be happy to. Thanks, Philip On 1/2/2014 11:23 AM, Andrew Ash wrote: For testing, maybe try using .collect and doing the comparison between expected and actual in memory rather than on disk? On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I just tried your suggestion and get the same results with the _temporary directory. Thanks though. On 1/2/2014 10:28 AM, Andrew Ash wrote: You want to write it to a local file on the machine? Try using file:///path/to/target/mydir/ instead I'm not sure what behavior would be if you did this on a multi-machine cluster though -- you may get a bit of data on each machine in that local directory. On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a very simple Spark application that looks like the following: var myRdd: RDD[Array[String]] = initMyRdd() println(myRdd.first.mkString(, )) println(myRdd.count) myRdd.saveAsTextFile(hdfs://myserver:8020/mydir) myRdd.saveAsTextFile(target/mydir/) The println statements work as expected. The first saveAsTextFile statement also works as expected. The second saveAsTextFile statement does not (even if the first is commented out.) I get the exception pasted below. If I inspect target/mydir I see that there is a directory called _temporary/0/_temporary/attempt_201401020953__m_00_1 which contains an empty part-0 file. It's curious because this code worked before with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be running this on Windows in local mode at the moment. Perhaps I should try running it on my linux box. Thanks, Philip Exception in thread main org.apache.spark.SparkException: Job aborted: Task 2.0:0 failed more than 0 times; aborting job java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
Re: multi-line elements
Thank you for pointing me in the right direction! On 12/24/2013 2:39 PM, suman bharadwaj wrote: Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above. Regards, Suman Bharadwaj S On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj suman@gmail.com mailto:suman@gmail.com wrote: Even I'm new to spark. But I was able to write a custom sentence input format and sentence record reader which reads multiple lines of text with record boundary being *[.?!]\s** using Hadoop APIs. And plugged in the SentenceInputFormat into the spark api as shown below: val inputRead = sc.hadoopFile(path to the file in hdfs,classOf*[SentenceTextInputFormat]*,classOf[LongWritable],classOf[Text]).map(value =value._2.toString) In your case, you can use the NLineInputFormat i guess which is provided by hadoop. And pass it as a parameter. May be there are better ways to do it. Regards, Suman Bharadwaj S On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a file that consists of multi-line records. Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile? Or do I need to pre-process the data so that all the data for one element is in a single line? Thanks, Philip
Re: writing to HDFS with a given username
Well, it only uses my user name when I run my application in local mode (i.e. spark is running on my laptop with a master url of local.) Not a general solution for you I'm afraid! On 12/12/2013 5:38 PM, Koert Kuipers wrote: Hey Philip, how do you get spark to write to hdfs with your user name? When i use spark it writes to hdfs as the user that runs the spark services... i wish it read and wrote as me. On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: When I call rdd.saveAsTextFile(hdfs://...) it uses my username to write to the HDFS drive. If I try to write to an HDFS directory that I do not have permissions to, then I get an error like this: Permission denied: user=me, access=WRITE, inode=/user/you/:you:us:drwxr-xr-x I can obviously get around this by changing the permissions on the directory /user/you. However, is it possible to call rdd.saveAsText with an alternate username and password? Thanks, Philip
exposing spark through a web service
Hi Spark Community, I would like to expose my spark application/libraries via a web service in order to launch jobs, interact with users, etc. I'm sure there are 100's of ways to think about doing this each with a variety of technology stacks that could be applied. So, I know there is no right answer here. But, I am curious to know if anyone has a fairly general solution that they are happy with that they would recommend or if there is any growing consensus in this community towards a solution or family of solutions that align with the Spark way of doing things. Any advice or thoughts are appreciated! Thanks, Philip
writing to HDFS with a given username
When I call rdd.saveAsTextFile(hdfs://...) it uses my username to write to the HDFS drive. If I try to write to an HDFS directory that I do not have permissions to, then I get an error like this: Permission denied: user=me, access=WRITE, inode=/user/you/:you:us:drwxr-xr-x I can obviously get around this by changing the permissions on the directory /user/you. However, is it possible to call rdd.saveAsText with an alternate username and password? Thanks, Philip
Re: Fwd: Spark forum question
You might try a more standard windows path. I typically write to a local directory such as target/spark-output. On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote: We are trying to test out running Spark 0.8.0 on a Windows box, and while we can get it to run all the examples that don't output results to disk, we can't get it to write output.. Has anyone been able to write out to a local file on a single node windows install without using hdfs? Here is our test code: object FileWritingTest { def main (args: Array[String]): Unit = { val sc = new SparkContext(local[1], File Writing Test, null, null, null, null); val res = sc.parallelize(Range(0, 10), 10).flatMap(p = %d.format(p * 10))//generate some work to do res.saveAsTextFile(file:///c:/somepath)//save the results out to a file } } This works as expected using a unix based system. However, when trying to run on a windows cmd shell I get the following errors: [WARN] 11 Dec 2013 12:00:33 - org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Saving as hadoop file of type (NullWritable, Text) [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Starting job: saveAsTextFile at Test.scala:19 [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Got job 0 (saveAsTextFile at Test.scala:19) with 10 output partitions (allowLocal=false) [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Final stage: Stage 0 (saveAsTextFile at Test.scala:19) [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Parents of final stage: List() [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Missing parents: List() [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Submitting Stage 0 (MappedRDD[2] at saveAsTextFile at Test.scala:19), which has no missing parents [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Submitting 10 missing tasks from Stage 0 (MappedRDD[2] at saveAsTextFile at Test.scala:19) [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Size of task 0 is 5966 bytes [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Running 0 [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Loss was due to org.apache.hadoop.util.Shell$ExitCodeException org.apache.hadoop.util.Shell$ExitCodeException: chmod: getting attributes of `/cygdrive/c/somepath/_temporary/_attempt_201312111200__m_00_0/part-0': No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:261) at org.apache.hadoop.util.Shell.run(Shell.java:188) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381) at org.apache.hadoop.util.Shell.execCommand(Shell.java:467) at org.apache.hadoop.util.Shell.execCommand(Shell.java:450) at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:593) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:584) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:427) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:465) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:781) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:118) at org.apache.hadoop.mapred.SparkHadoopWriter.open(SparkHadoopWriter.scala:86) at org.apache.spark.rdd.PairRDDFunctions.writeToFile$1(PairRDDFunctions.scala:667) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680) at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99) at org.apache.spark.scheduler.local.LocalScheduler.runTask(LocalScheduler.scala:198) at org.apache.spark.scheduler.local.LocalActor$$anonfun$launchTask$1$$anon$1.run(LocalScheduler.scala:68) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Remove TaskSet 0.0 from pool [INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Failed to run saveAsTextFile at Test.scala:19 Exception in thread main
Re: Writing an RDD to Hive
Any chance you could sketch out the Shark APIs that you use for this? Matei's response suggests that the preferred API is coming in the next release (i.e. RDDTable class in 0.8.1). Are you building Shark from the latest in the repo and using that? Or have you figured out other API calls that accomplish something similar? Thanks, Philip On 12/8/2013 2:44 AM, Christopher Nguyen wrote: Philip, fwiw we do go with including Shark as a dependency for our needs, making a fat jar, and it works very well. It was quite a bit of pain what with the Hadoop/Hive transitive dependencies, but for us it was worth it. I hope that serves as an existence proof that says Mt Everest has been climbed, likely by more than just ourselves. Going forward this should be getting easier. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a simple scenario that I'm struggling to implement. I would like to take a fairly simple RDD generated from a large log file, perform some transformations on it, and write the results out such that I can perform a Hive query either from Hive (via Hue) or Shark. I'm having troubles with the last step. I am able to write my data out to HDFS and then execute a Hive create table statement followed by a load data statement as a separate step. I really dislike this separate manual step and would like to be able to have it all accomplished in my Spark application. To this end, I have investigated two possible approaches as detailed below - it's probably too much information so I'll ask my more basic question first: Does anyone have a basic recipe/approach for loading data in an RDD to a Hive table from a Spark application? 1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset. There is a nice detailed email on how to do this here http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E. I didn't get very far thought because as soon as I added an hbase dependency (corresponding to the version of hbase we are running) to my pom.xml file, I had an slf4j dependency conflict that caused my current application to explode. I tried the latest released version and the slf4j dependency problem went away but then the deprecated class TableOutputFormat no longer exists. Even if loading the data into hbase were trivially easy (and the detailed email suggests otherwise) I would then need to query HBase from Hive which seems a little clunky. 2) So, I decided that Shark might be an easier option. All the examples provided in their documentation seem to assume that you are using Shark as an interactive application from a shell. Various threads I've seen seem to indicate that Shark isn't really intended to be used as dependency in your Spark code (see this https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion and that https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion.) It follows then that one can't add a Shark dependency to a pom.xml file because Shark isn't released via Maven Central (that I can tell perhaps it's in some other repo?) Of course, there are ways of creating a local dependency in maven but it starts to feel very hacky. I realize that I've given sufficient detail to expose my ignorance in a myriad of ways. Please feel free to shine light on any of my misconceptions! Thanks, Philip
Writing an RDD to Hive
I have a simple scenario that I'm struggling to implement. I would like to take a fairly simple RDD generated from a large log file, perform some transformations on it, and write the results out such that I can perform a Hive query either from Hive (via Hue) or Shark. I'm having troubles with the last step. I am able to write my data out to HDFS and then execute a Hive create table statement followed by a load data statement as a separate step. I really dislike this separate manual step and would like to be able to have it all accomplished in my Spark application. To this end, I have investigated two possible approaches as detailed below - it's probably too much information so I'll ask my more basic question first: Does anyone have a basic recipe/approach for loading data in an RDD to a Hive table from a Spark application? 1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset. There is a nice detailed email on how to do this here http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E. I didn't get very far thought because as soon as I added an hbase dependency (corresponding to the version of hbase we are running) to my pom.xml file, I had an slf4j dependency conflict that caused my current application to explode. I tried the latest released version and the slf4j dependency problem went away but then the deprecated class TableOutputFormat no longer exists. Even if loading the data into hbase were trivially easy (and the detailed email suggests otherwise) I would then need to query HBase from Hive which seems a little clunky. 2) So, I decided that Shark might be an easier option. All the examples provided in their documentation seem to assume that you are using Shark as an interactive application from a shell. Various threads I've seen seem to indicate that Shark isn't really intended to be used as dependency in your Spark code (see this https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion and that https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion.) It follows then that one can't add a Shark dependency to a pom.xml file because Shark isn't released via Maven Central (that I can tell perhaps it's in some other repo?) Of course, there are ways of creating a local dependency in maven but it starts to feel very hacky. I realize that I've given sufficient detail to expose my ignorance in a myriad of ways. Please feel free to shine light on any of my misconceptions! Thanks, Philip
Re: write data into HBase via spark
Hao, Thank you for the detailed response! (even if delayed!) I'm curious to know what version of hbase you added to your pom file. Thanks, Philip On 11/14/2013 10:38 AM, Hao REN wrote: Hi, Philip. Basically, we need* PairRDDFunctions.saveAsHadoopDataset* to do the job, as HBase is not a fs, saveAsHadoopFile doesn't work. *def saveAsHadoopDataset(conf: JobConf): Unit* this function takes a JobConf parameter which should be configured. Essentially, you need to set output format and the name of the output table. *// step 1: JobConf setup:* // Note: mapred package is used, instead of the mapreduce package which contains new hadoop APIs. *import org.apache.hadoop.hbase.mapred.TableOutputFormat * *import org.apache.hadoop.hbase.client._* // ... some other settings *val conf = HBaseConfiguration.create()* // general hbase setting *conf.set(hbase.rootdir, hdfs:// + nameNodeURL + : + hdfsPort + /hbase)* *conf.setBoolean(hbase.cluster.distributed, true)* *conf.set(hbase.zookeeper.quorum, hostname)* *conf.setInt(hbase.client.scanner.caching, 1)* // ... some other settings *val jobConfig: JobConf = new JobConf(conf, this.getClass)* // Note: TableOutputFormat is used as deprecated code, because JobConf is an old hadoop API *jobConfig.setOutputFormat(classOf[TableOutputFormat])* *jobConfig.set(TableOutputFormat.OUTPUT_TABLE, outputTable)* *// step 2: give your mapping:* * * // the last thing todo is mapping your local data schema to the hbase one // Say, our hbase schema is as below: // *rowcf:col_1cf:col_2* // And in spark, you have a RDD of triple, like (1, 2, 3), (4, 5, 6), ... // So you should map *RDD[(int, int, int)]* to *RDD[(ImmutableBytesWritable, Put)]*, where Put carries the mapping. // You can define a function used by RDD.map, for example: *def convert(triple: (Int, Int, Int)) = {* * val p = new Put(Bytes.toBytes(triple._1))* * p.add(Bytes.toBytes(cf), Bytes.toBytes(col_1), Bytes.toBytes(triple._2))* * p.add(Bytes.toBytes(cf), Bytes.toBytes(col_2), Bytes.toBytes(triple._3))* * (new ImmutableBytesWritable, p)* *}* // Suppose you have a *RDD[(Int, Int, Int)]* called *localData*, then writing data to hbase can be done by : *new PairRDDFunctions(localData.map(convert)).saveAsHadoopDataset(jobConfig)* Voilà. That's all you need. Hopefully, this simple example could help. Hao. 2013/11/13 Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com Hao, If you have worked out the code and turn it into an example that you can share, then please do! This task is in my queue of things to do so any helpful details that you uncovered would be most appreciated. Thanks, Philip On 11/13/2013 5:30 AM, Hao REN wrote: Ok, I worked it out. The following thread helps a lot. http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C7B4868A9-B83E-4507-BB2A-2721FCE8E738%40gmail.com%3E Hao 2013/11/12 Hao REN julien19890...@gmail.com mailto:julien19890...@gmail.com Could someone show me a simple example about how to write data into HBase via spark ? I have checked HbaseTest example, it's only for reading from HBase. Thank you. -- REN Hao Data Engineer @ ClaraVista Paris, France Tel: +33 06 14 54 57 24 tel:%2B33%2006%2014%2054%2057%2024 -- REN Hao Data Engineer @ ClaraVista Paris, France Tel: +33 06 14 54 57 24 tel:%2B33%2006%2014%2054%2057%2024 -- REN Hao Data Engineer @ ClaraVista Paris, France Tel: +33 06 14 54 57 24
Re: Writing to HBase
Here's a good place to start: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E On 12/5/2013 10:18 AM, Benjamin Kim wrote: Does anyone have an example or some sort of starting point code when writing from Spark Streaming into HBase? We currently stream ad server event log data using Flume-NG to tail log entries, collect them, and put them directly into a HBase table. We would like to do the same with Spark Streaming. But, we would like to do the data massaging and simple data analysis before. This will cut down the steps in prepping data and the number of tables for our data scientists and real-time feedback systems. Thanks, Ben
Re: write data into HBase via spark
Hao, If you have worked out the code and turn it into an example that you can share, then please do! This task is in my queue of things to do so any helpful details that you uncovered would be most appreciated. Thanks, Philip On 11/13/2013 5:30 AM, Hao REN wrote: Ok, I worked it out. The following thread helps a lot. http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C7B4868A9-B83E-4507-BB2A-2721FCE8E738%40gmail.com%3E Hao 2013/11/12 Hao REN julien19890...@gmail.com mailto:julien19890...@gmail.com Could someone show me a simple example about how to write data into HBase via spark ? I have checked HbaseTest example, it's only for reading from HBase. Thank you. -- REN Hao Data Engineer @ ClaraVista Paris, France Tel: +33 06 14 54 57 24 tel:%2B33%2006%2014%2054%2057%2024 -- REN Hao Data Engineer @ ClaraVista Paris, France Tel: +33 06 14 54 57 24
code review - splitting columns
Hi Spark community, I learned a lot the last time I posted some elementary Spark code here. So, I thought I would do it again. Someone politely tell me offline if this is noise or unfair use of the list! I acknowledge that this borders on asking Scala 101 questions I have an RDD[List[String]] corresponding to columns of data and I want to split one of the columns using some arbitrary function and return an RDD updated with the new columns. Here is the code I came up with. def splitColumn(columnsRDD: RDD[List[String]], columnIndex: Int, numSplits: Int, splitFx: String = List[String]): RDD[List[String]] = { def insertColumns(columns: List[String]) : List[String] = { val split = columns.splitAt(columnIndex) val left = split._1 val splitColumn = split._2.head val splitColumns = splitFx(splitColumn).padTo(numSplits, ).take(numSplits) val right = split._2.tail left ++ splitColumns ++ right } columnsRDD.map(columns = insertColumns(columns)) } Here is a simple test that demonstrates the behavior: val spark = new SparkContext(local, test spark) val testStrings = List(List(1.2, a b), List(3.4, c d e), List(5.6, f)) var testRDD: RDD[List[String]] = spark.parallelize(testStrings) testRDD = splitColumn(testRDD, 0, 2, _.split(\\.).toList) testRDD = splitColumn(testRDD, 2, 2, _.split( ).toList) //Line 5 val actualStrings = testRDD.collect.toList assertEquals(4, actualStrings(0).length) assertEquals(1, 2, a, b, actualStrings(0).mkString(, )) assertEquals(4, actualStrings(1).length) assertEquals(3, 4, c, d, actualStrings(1).mkString(, )) assertEquals(4, actualStrings(2).length) assertEquals(5, 6, f, , actualStrings(2).mkString(, )) My first concern about this code is that I'm missing out on something that does exactly this in the API. This seems like such a common use case that I would not be surprised if there's a readily available way to do this. I'm a little uncertain about the typing of splitColumn - i.e. the first parameter and the return value. It seems like a general solution wouldn't require every column to be a String value. I'm also annoyed that line 5 in the test code requires that I use an updated index to split what was originally the second column. This suggests that perhaps I should split all the columns that need splitting in one function call - but it seems like doing that would require an unwieldy function signature. Any advice or insight is appreciated! Thanks, Philip
code review - counting populated columns
Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line = line.split(\t).toList) //RDD of List[Int] corresponding to populated columns (1 for populated and 0 for not populated) val populatedColumns = columns.map(row = row.map(column = if(column.length 0) 1 else 0)) //List[Int] contains sums of the 1's in each column val counts = populatedColumns.reduce((row1,row2) =(row1,row2).zipped.map(_+_)) Any thoughts about the fitness of this code snippet? I'm a little annoyed by creating an RDD full of 1's and 0's in the second line. The if statement feels awkward too. I was happy to find the zipped method for the reduce step. Any feedback you might have on how to improve this code is appreciated. I'm a newbie to both Scala and Spark. Thanks, Philip
Re: code review - counting populated columns
Where does 'emit' come from? I don't see it in the Scala or Spark apidocs (though I don't feel very deft at searching either!) Thanks, Philip On 11/8/2013 2:23 PM, Patrick Wendell wrote: It would be a bit more straightforward to write it like this: val columns = [same as before] val counts = columns.flatMap(emit (col_id, 0 or 1) for each column).reduceByKey(_+ _) Basically look at each row and emit several records using flatMap. Each record has an ID for the column (maybe its index) and a flag for whether it's present. Then you reduce by key to get the per-column count. Then you can collect at the end. - Patrick On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote: Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line = line.split(\t).toList) //RDD of List[Int] corresponding to populated columns (1 for populated and 0 for not populated) val populatedColumns = columns.map(row = row.map(column = if(column.length 0) 1 else 0)) //List[Int] contains sums of the 1's in each column val counts = populatedColumns.reduce((row1,row2) =(row1,row2).zipped.map(_+_)) Any thoughts about the fitness of this code snippet? I'm a little annoyed by creating an RDD full of 1's and 0's in the second line. The if statement feels awkward too. I was happy to find the zipped method for the reduce step. Any feedback you might have on how to improve this code is appreciated. I'm a newbie to both Scala and Spark. Thanks, Philip
Re: code review - counting populated columns
Thank you for the pointers. I'm not sure I was able to fully understand either of your suggestions but here is what I came up with. I started with Tom's code but I think I ended up borrowing from Patrick's suggestion too. Any thoughts about my updated solution are more than welcome! I added local variable types for clarify. def countPopulatedColumns(tsv: RDD[String]) : RDD[(Int, Int)] = { //split by tab and zip with index to give column value, column index pairs val sparse : RDD[(String, Int)] = tsv.flatMap(line = line.split(\t).zipWithIndex) //filter out all the zero length values val dense : RDD[(String, Int)] = sparse.filter(valueIndex = valueIndex._1.length0) //map each column index to one and do the usual reduction dense.map(valueIndex = (valueIndex._2, 1)).reduceByKey(_+_) } Of course, this can be condensed to a single line but it doesn't seem as easy to read as the more verbose code above. Write-once code like the following is why I never liked Perl def cpc(tsv: RDD[String]) : RDD[(Int, Int)] = { tsv.flatMap(_.split(\t).zipWithIndex).filter(ci = ci._1.length0).map(ci = (ci._2, 1)).reduceByKey(_+_) } Thanks, Philip On 11/8/2013 2:41 PM, Patrick Wendell wrote: Hey Tom, reduceByKey will reduce locally on all the nodes, so there won't be any data movement except to combine totals at the end. - Patrick On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek minnesota...@gmail.com wrote: Your example requires each row to be exactly the same length, since zipped will truncate to the shorter of its two arguments. The second solution is elegant, but reduceByKey involves flying a bunch of data around to sort the keys. I suspect it would be a lot slower. But you could save yourself from adding up a bunch of zeros: val sparseRows = spark.textFile(myfile.tsv).map(line = line.split(\t).zipWithIndex.filter(_._1.length0)) sparseRows.reduce(mergeAdd(_,_)) You'll have to write a mergeAdd function. This might not be any faster, but it does allow variable length rows. On Fri, Nov 8, 2013 at 3:23 PM, Patrick Wendell pwend...@gmail.com wrote: It would be a bit more straightforward to write it like this: val columns = [same as before] val counts = columns.flatMap(emit (col_id, 0 or 1) for each column).reduceByKey(_+ _) Basically look at each row and emit several records using flatMap. Each record has an ID for the column (maybe its index) and a flag for whether it's present. Then you reduce by key to get the per-column count. Then you can collect at the end. - Patrick On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote: Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line = line.split(\t).toList) //RDD of List[Int] corresponding to populated columns (1 for populated and 0 for not populated) val populatedColumns = columns.map(row = row.map(column = if(column.length 0) 1 else 0)) //List[Int] contains sums of the 1's in each column val counts = populatedColumns.reduce((row1,row2) =(row1,row2).zipped.map(_+_)) Any thoughts about the fitness of this code snippet? I'm a little annoyed by creating an RDD full of 1's and 0's in the second line. The if statement feels awkward too. I was happy to find the zipped method for the reduce step. Any feedback you might have on how to improve this code is appreciated. I'm a newbie to both Scala and Spark. Thanks, Philip
Where is reduceByKey?
On the front page http://spark.incubator.apache.org/ of the Spark website there is the following simple word count implementation: file = spark.textFile(hdfs://...) file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_ + _) The same code can be found in the Quick Start http://spark.incubator.apache.org/docs/latest/quick-start.html quide. When I follow the steps in my spark-shell (version 0.8.0) it works fine. The reduceByKey method is also shown in the list of transformations http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#transformations in the Spark Programming Guide. The bottom of this list directs the reader to the API docs for the class RDD (this link is broken, BTW). The API docs for RDD http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD does not list a reduceByKey method for RDD. Also, when I try to compile the above code in a Scala class definition I get the following compile error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(java.lang.String, Int)] I am compiling with maven using the following dependency definition: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.9.3/artifactId version0.8.0-incubating/version /dependency Can someone help me understand why this code works fine from the spark-shell but doesn't seem to exist in the API docs and won't compile? Thanks, Philip
Re: Where is reduceByKey?
Thanks - I think this would be a helpful note to add to the docs. I went and read a few things about Scala implicit conversions (I'm obviously new to the language) and it seems like a very powerful language feature and now that I know about them it will certainly be easy to identify when they are missing (i.e. the first thing to suspect when you see a not a member compilation message.) I'm still a bit mystified as to how you would go about finding the appropriate imports except that I suppose you aren't very likely to use methods that you don't already know about! Unless you are copying code verbatim that doesn't have the necessary import statements On 11/7/2013 4:05 PM, Matei Zaharia wrote: Yeah, this is confusing and unfortunately as far as I know it’s API specific. Maybe we should add this to the documentation page for RDD. The reason for these conversions is to only allow some operations based on the underlying data type of the collection. For example, Scala collections support sum() as long as they contain numeric types. That’s fine for the Scala collection library since its conversions are imported by default, but I guess it makes it confusing for third-party apps. Matei On Nov 7, 2013, at 1:15 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I remember running into something very similar when trying to perform a foreach on java.util.List and I fixed it by adding the following import: import scala.collection.JavaConversions._ And my foreach loop magically compiled - presumably due to a another implicit conversion. Now this is the second time I've run into this problem and I didn't recognize it. I'm not sure that I would know what to do the next time I run into this. Do you have some advice on how I should have recognized a missing import that provides implicit conversions and how I would know what to import? This strikes me as code obfuscation. I guess this is more of a Scala question Thanks, Philip On 11/7/2013 2:01 PM, Josh Rosen wrote: The additional methods on RDDs of pairs are defined in a class called PairRDDFunctions (https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions). SparkContext provides an implicit conversion from RDD[T] to PairRDDFunctions[T] to make this transparent to users. To import those implicit conversions, use import org.apache.spark.SparkContext._ These conversions are automatically imported by Spark Shell, but you'll have to import them yourself in standalone programs. On Thu, Nov 7, 2013 at 11:54 AM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: On the front page http://spark.incubator.apache.org/ of the Spark website there is the following simple word count implementation: file = spark.textFile(hdfs://...) file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_ + _) The same code can be found in the Quick Start http://spark.incubator.apache.org/docs/latest/quick-start.html quide. When I follow the steps in my spark-shell (version 0.8.0) it works fine. The reduceByKey method is also shown in the list of transformations http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#transformations in the Spark Programming Guide. The bottom of this list directs the reader to the API docs for the class RDD (this link is broken, BTW). The API docs for RDD http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD does not list a reduceByKey method for RDD. Also, when I try to compile the above code in a Scala class definition I get the following compile error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(java.lang.String, Int)] I am compiling with maven using the following dependency definition: dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.9.3/artifactId version0.8.0-incubating/version /dependency Can someone help me understand why this code works fine from the spark-shell but doesn't seem to exist in the API docs and won't compile? Thanks, Philip
compare/contrast Spark with Cascading
My team is investigating a number of technologies in the Big Data space. A team member recently got turned on to Cascading http://www.cascading.org/about-cascading/ as an application layer for orchestrating complex workflows/scenarios. He asked me if Spark had an application layer? My initial reaction is no that Spark would not have a separate orchestration/application layer. Instead, the core Spark API (along with Streaming) would compete directly with Cascading for this kind of functionality and that the two would not likely be all that complementary. I realize that I am exposing my ignorance here and could be way off. Is there anyone who knows a bit about both of these technologies who could speak to this in broad strokes? Thanks! Philip
Re: set up spark in eclipse
Hi Arun, I had recent success getting a Spark project set up in Eclipse Juno. Here are the notes that I wrote down for the rest of my team that you may perhaps find useful: Spark version 0.8.0 requires Scala version 2.9.3. This is a bit inconvenient because Scala is now on version 2.10.3 and the latest and greatest versions of the Scala IDE plugin for Eclipse work with the latest versions of Scala. Getting a Scala plugin for Eclipse for version 2.9.3 requires a little bit of effort (but not too much.)The following link has information about using Eclipse with Scala 2.9.3:http://scala-ide.org/download/current.html.Release 3.0.0 supports Scala 2.9.3 and there are two update sites (in green boxes). One is for Eclipse Juno: http://download.scala-ide.org/sdk/e38/scala29/stable/site http://download.scala-ide.org/sdk/e38/scala29/stable/site%3C/span%3ESo, download Eclipse Juno SR2 64-bit for windows from eclipse.org and add the Scala IDE using the above update site. If you can't get theupdate site to work, then you may need to download a zip file that is a downloadableupdate site from here:http://download.scala-ide.org/sdk/e38/scala29/stable/update-site.zip Philip On 10/26/2013 11:20 PM, Arun Kumar wrote: Hi I was trying to set up spark project in eclipse. I used sbt way to create a simple spark project as desc in documentation [1]. Then i used sbteclipse plugin[2] to create the eclipse project. But I am getting errors when I import the project in eclipse. I seem to be getting it because the project is using scala 2.10.1 version and there is no easy way to use scala 2.9.2. Was anybody else more successfull in setting this up. Are there any other IDEs I could use? Can maven be used to create the scala/spark project and import it into eclipse? 1. http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala 2. https://github.com/typesafehub/sbteclipse Thanks
Re: unable to serialize analytics pipeline
A simple workaround that seems to work (at least in localhost mode) is to mark my top-level pipeline object (inside my simple interface) as transient and add an initialize method. In the method that calls the pipeline and returns the results, I simply call the initialize method if needed (i.e. if the pipeline object is null.) This seems reasonable to me. I will try it on an actual cluster next Thanks, Philip On 10/22/2013 11:50 AM, Philip Ogren wrote: I have a text analytics pipeline that performs a sequence of steps (e.g. tokenization, part-of-speech tagging, etc.) on a line of text. I have wrapped the whole pipeline up into a simple interface that allows me to call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a string, and get back some objects. Now, I would like to do the same thing for items in a Spark RDD via a map transformation. Unfortunately, my pipeline is not serializable and so I get a NotSerializableException when I try this. I played around with Kryo just now to see if that could help and I ended up with a missing no-arg constructor exception on a class I have no control over. It seems the Spark framework expects that I should be able to serialize my pipeline when I can't (or at least don't think I can at first glance.) Is there a workaround for this scenario? I am imagining a few possible solutions that seem a bit dubious to me, so I thought I would ask for direction before wandering about. Perhaps a better understanding of serialization strategies might help me get the pipeline to serialize. Or perhaps there is a way to instantiate my pipeline on demand on the nodes through a factory call. Any advice is appreciated. Thanks, Philip
[jira] [Commented] (DERBY-4921) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver
[ https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570302#comment-13570302 ] Philip Ogren commented on DERBY-4921: - If you change the client driver to behave as the embedded driver as you suggest above, then it will not be possible for my example code to work with both the Derby driver and the PostgreSQL driver without adding a driver-specific conditional. Please note Knut's observation about PostgreSQL above. It would be much preferable to me to see the embedded driver behave like the client driver because that will be much less likely break other people's code (including mine.) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver -- Key: DERBY-4921 URL: https://issues.apache.org/jira/browse/DERBY-4921 Project: Derby Issue Type: Bug Components: JDBC Affects Versions: 10.6.2.1 Reporter: Jarek Przygódzki Statement.executeUpdate(insertSql, int[] columnIndexes) and Statement/executeUpdate(insertSql,Statement.RETURN_GENERATED_KEYS) does work, Statement.executeUpdate(String sql, String[] columnNames) doesn't. Test program import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class GetGeneratedKeysTest { static String createTableSql = CREATE TABLE tbl (id integer primary key generated always as identity, name varchar(200)); static String insertSql = INSERT INTO tbl(name) values('value'); static String driver = org.apache.derby.jdbc.EmbeddedDriver; static String[] idColName = { id }; public static void main(String[] args) throws Exception { Class.forName(driver); Connection conn = DriverManager .getConnection(jdbc:derby:testDb;create=true); conn.setAutoCommit(false); Statement stmt = conn.createStatement(); ResultSet rs; stmt.executeUpdate(createTableSql); stmt.executeUpdate(insertSql, idColName); rs = stmt.getGeneratedKeys(); if (rs.next()) { int id = rs.getInt(1); } conn.commit(); } } Result Exception in thread main java.sql.SQLException: Table 'TBL' does not have an auto-generated column named 'id'. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:95) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Util.java:256) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:391) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2269) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1321) at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:625) at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(EmbedStatement.java:246) at GetGeneratedKeysTest.main(GetGeneratedKeysTest.java:23) Caused by: java.sql.SQLException: Table 'TBL' does not have an auto-generated column named 'id'. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:119) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70) ... 9 more Caused by: ERROR X0X0F: Table 'TBL' does not have an auto-generated column named 'id'. at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:303) at org.apache.derby.impl.sql.execute.InsertResultSet.verifyAutoGeneratedColumnsNames(InsertResultSet.java:689) at org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:419) at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1232) ... 3 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see
[jira] [Commented] (DERBY-4921) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver
[ https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569193#comment-13569193 ] Philip Ogren commented on DERBY-4921: - I would like to challenge the decision to close this issue. I was really confused about why I was seeing this behavior today even after finding and reading through this issue because I had never seen it before. I finally realized that it was because the behavior between the EmbeddedDriver and ClientDriver differs on this exact point and I had always used the client driver on my code in question. I think it is one thing to throw up your hands because the spec is vague (and even still, it seems obvious to me that this is a bug) but I think it is not really defensible to have the two drivers behave differently. Please reopen and fix! Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver -- Key: DERBY-4921 URL: https://issues.apache.org/jira/browse/DERBY-4921 Project: Derby Issue Type: Bug Components: JDBC Affects Versions: 10.6.2.1 Reporter: Jarek Przygódzki Statement.executeUpdate(insertSql, int[] columnIndexes) and Statement/executeUpdate(insertSql,Statement.RETURN_GENERATED_KEYS) does work, Statement.executeUpdate(String sql, String[] columnNames) doesn't. Test program import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class GetGeneratedKeysTest { static String createTableSql = CREATE TABLE tbl (id integer primary key generated always as identity, name varchar(200)); static String insertSql = INSERT INTO tbl(name) values('value'); static String driver = org.apache.derby.jdbc.EmbeddedDriver; static String[] idColName = { id }; public static void main(String[] args) throws Exception { Class.forName(driver); Connection conn = DriverManager .getConnection(jdbc:derby:testDb;create=true); conn.setAutoCommit(false); Statement stmt = conn.createStatement(); ResultSet rs; stmt.executeUpdate(createTableSql); stmt.executeUpdate(insertSql, idColName); rs = stmt.getGeneratedKeys(); if (rs.next()) { int id = rs.getInt(1); } conn.commit(); } } Result Exception in thread main java.sql.SQLException: Table 'TBL' does not have an auto-generated column named 'id'. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:95) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Util.java:256) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:391) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2269) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1321) at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:625) at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(EmbedStatement.java:246) at GetGeneratedKeysTest.main(GetGeneratedKeysTest.java:23) Caused by: java.sql.SQLException: Table 'TBL' does not have an auto-generated column named 'id'. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:119) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70) ... 9 more Caused by: ERROR X0X0F: Table 'TBL' does not have an auto-generated column named 'id'. at org.apache.derby.iapi.error.StandardException.newException(StandardException.java:303) at org.apache.derby.impl.sql.execute.InsertResultSet.verifyAutoGeneratedColumnsNames(InsertResultSet.java:689) at org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:419) at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436) at org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1232) ... 3
[jira] [Commented] (DERBY-4921) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver
[ https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569197#comment-13569197 ] Philip Ogren commented on DERBY-4921: - I prepared some code that demonstrates the behavior that I ran on the latest driver (10.9.1.0). I don't see away to attach it so I will just paste it here. Hopefully, it will render ok. import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.ResultSet; /** * to run testClientDriver first start the Derby server with something like: * java -jar derbyrun.jar server start * * to run testEmbeddedDriver change the local variable 'location' to some directory that exists on your system. */ public class DerbyTest { public static final String CREATE_TABLE_SQL = create table test_table (id int generated always as identity primary key, name varchar(100)); public static final String INSERT_SQL = insert into test_table (name) values (?); public static void testClientDriver() throws Exception { String driver = org.apache.derby.jdbc.ClientDriver; Class.forName(driver).newInstance(); Connection connection = DriverManager.getConnection(jdbc:derby://localhost:1527/testdb;create=true); connection.prepareStatement(drop table test_table).execute(); connection.prepareStatement(CREATE_TABLE_SQL).execute(); for(String columnName : new String[] {ID, id}) { PreparedStatement statement = connection.prepareStatement(INSERT_SQL, new String[] { columnName}); statement.setString(1, my name); statement.executeUpdate(); ResultSet results = statement.getGeneratedKeys(); results.next(); int id = results.getInt(1); System.out.println(id); } } public static void testEmbeddedDriver() throws Exception { String driver = org.apache.derby.jdbc.EmbeddedDriver; Class.forName(driver).newInstance(); String location = D:\\temp; Connection connection = DriverManager.getConnection(jdbc:derby:+location+\\testdb;create=true); connection.prepareStatement(drop table test_table).execute(); connection.prepareStatement(CREATE_TABLE_SQL).execute(); for(String columnName : new String[] {ID, id}) { PreparedStatement statement = connection.prepareStatement(INSERT_SQL, new String[] { columnName}); statement.setString(1, my name); statement.executeUpdate(); ResultSet results = statement.getGeneratedKeys(); results.next(); int id = results.getInt(1); System.out.println(column name=+columnName+: +id); } } public static void main(String[] args) throws Exception { testClientDriver(); testEmbeddedDriver(); } } Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver -- Key: DERBY-4921 URL: https://issues.apache.org/jira/browse/DERBY-4921 Project: Derby Issue Type: Bug Components: JDBC Affects Versions: 10.6.2.1 Reporter: Jarek Przygódzki Statement.executeUpdate(insertSql, int[] columnIndexes) and Statement/executeUpdate(insertSql,Statement.RETURN_GENERATED_KEYS) does work, Statement.executeUpdate(String sql, String[] columnNames) doesn't. Test program import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.Statement; public class GetGeneratedKeysTest { static String createTableSql = CREATE TABLE tbl (id integer primary key generated always as identity, name varchar(200)); static String insertSql = INSERT INTO tbl(name) values('value'); static String driver = org.apache.derby.jdbc.EmbeddedDriver; static String[] idColName = { id }; public static void main(String[] args) throws Exception { Class.forName(driver); Connection conn = DriverManager .getConnection(jdbc:derby:testDb;create=true); conn.setAutoCommit(false); Statement stmt = conn.createStatement(); ResultSet rs; stmt.executeUpdate(createTableSql); stmt.executeUpdate(insertSql, idColName); rs = stmt.getGeneratedKeys
[jira] Commented: (UIMA-1983) JCasGen prouces source files with name shadowing/conflicts
[ https://issues.apache.org/jira/browse/UIMA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977607#action_12977607 ] Philip Ogren commented on UIMA-1983: If I define a type called MyAnnotation in a type system descriptor file, then MyAnnotation.java and MyAnnotation_Type.java can be generated. The following can be compiler warnings in Java 1.5: MyAnnotation.java: - The field MyAnnotation.typeIndexID is hiding a field from type Annotation - The field MyAnnotation.type is hiding a field from type Annotation Both of these can be ignored with @SuppressWarnings(hiding) MyAnnotation_Type.java: - The field MyAnnotation_Type.featOkTst is hiding a field from type Annotation_Type - The field MyAnnotation_Type.typeIndexID is hiding a field from type Annotation_Type Both of these can be ignored with @SuppressWarnings(hiding) When compiling with Java 1.6, I get the following additional warnings: org.uimafit.type.MyAnnotation.readObject() - Empty block should be documented This one can be fixed by simply adding a comment to the code block such as /*generated code */ org.uimafit.type.AnalyzedText.getTypeIndexID() - The method getTypeIndexID() of type AnalyzedText should be tagged with @Override since it actually overrides a superclass method This one can be fixed by adding the @Override annotation. org.uimafit.type.AnalyzedText_Type.getFSGenerator() - The method getFSGenerator() of type AnalyzedText_Type should be tagged with @Override since it actually overrides a superclass method This one can be fixed by adding the @Override annotation. It may be possible that there are other compiler warnings - but this is what I see with my current configuration which is fairly standard, I think. Thanks. JCasGen prouces source files with name shadowing/conflicts -- Key: UIMA-1983 URL: https://issues.apache.org/jira/browse/UIMA-1983 Project: UIMA Issue Type: Improvement Components: Core Java Framework Reporter: Philip Ogren Priority: Minor When the compiler warnings are set to complain when name shadowing or name conflicts exist, then the source files produced by JCasGen contain many warnings. It sure would be nice if these files came out pristine rather than having compiler warnings. Eclipse does not seem to allow for fine grained compiler warning configuration (i.e. to ignore certain warnings for certain source folders or packages) but only works at the project level. Therefore, I must either turn these warnings off for the entire project or must ignore the warnings in the type system java files. I'm guessing that this is a side effect of an intentional design decision (re eg typeIndexID) and so I am not that hopeful that this can be fixed but thought I would ask anyways. Thanks, Philip -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (UIMA-1983) JCasGen prouces source files with name shadowing/conflicts
JCasGen prouces source files with name shadowing/conflicts -- Key: UIMA-1983 URL: https://issues.apache.org/jira/browse/UIMA-1983 Project: UIMA Issue Type: Improvement Components: Core Java Framework Reporter: Philip Ogren Priority: Minor When the compiler warnings are set to complain when name shadowing or name conflicts exist, then the source files produced by JCasGen contain many warnings. It sure would be nice if these files came out pristine rather than having compiler warnings. Eclipse does not seem to allow for fine grained compiler warning configuration (i.e. to ignore certain warnings for certain source folders or packages) but only works at the project level.Therefore, I must either turn these warnings off for the entire project or must ignore the warnings in the type system java files. I'm guessing that this is a side effect of an intentional design decision (re eg typeIndexID) and so I am not that hopeful that this can be fixed but thought I would ask anyways. Thanks, Philip -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
CAS Editor as a stand-alone application?
I am wondering if it is possible to run the CAS Editor as a stand-alone application or if it is only available as a plugin within Eclipse. Thanks, Philip
[jira] Commented: (UIMA-1875) ability to visualize and quickly update/add values to primitive features
[ https://issues.apache.org/jira/browse/UIMA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936054#action_12936054 ] Philip Ogren commented on UIMA-1875: This looks really promising - thanks for investigating! One very simple thing to try would be to center the tags under the annotations. Additionally, you could truncate the tags in cases where they still overlap. Maybe it would be possible to see the full tag when the mouse hovers over it. This otherwise looks really nice, btw. Even as it is - I would probably willing to use it now for pos tagging. ability to visualize and quickly update/add values to primitive features Key: UIMA-1875 URL: https://issues.apache.org/jira/browse/UIMA-1875 Project: UIMA Issue Type: New Feature Components: CasEditor Reporter: Philip Ogren Assignee: Jörn Kottmann Attachments: CasEditor-TagDrawingStrategy.tiff, CasEditor-TagDrawingStrategyOverlap.tiff I spent a bit of time evaluating the CAS Editor recently and have the following suggestion. It is common to have annotation tasks in which adding a primitive value to a annotation feature happens frequently. Here's one common annotation task - part-of-speech tagging. Usually, the way this task is performed is a part-of-speech tagger is run on some data and a part-of-speech tag is added as a string value to a feature of a token type. The annotator's task is then to look at the part-of-speech tags and make sure they look right and fix the ones that aren't. However, the only way to see the part-of-speech tag is by clicking on the token annotation in the text and view the value of the feature in the editor view. This makes the tool really unusable for this annotation task. What would be really nice is to be able to display the part-of-speech tags above or below the tokens so that the linguist can scan the sentence with its tags and quickly find the errors. There are a number of other annotation tasks that have similar requirements. For example, named entities usually have category labels which would be nice to display. Word sense disambiguation data is also similar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Training and Learning
Borobudur, Do you mean training in the machine learning sense? If so, UIMA does not directly support any notion of training and using statistical classifiers. You might check out ClearTK http://cleartk.googlecode.com which is a UIMA-based project that provides support for a number of machine learning classifiers and supports feature extraction, training, and classification. Philip On 11/12/2010 7:39 AM, borobudur wrote: Hi, I had a look at the UIMA architecture and I was asking myself if there is a konzept for a training and learning phase in UIMA. What I learned is that UIMA chains Analysis Engines together. There is no two phase concept like training and extraction in UIMA, isn't it? Thanks borobudur - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1153 / Virus Database: 424/3251 - Release Date: 11/11/10
Re: AW: Compare two CASes
Hi Armin, I am glad you have found uimaFIT useful. I'm sorry about the lack of information about the dependency. It is documented in some sense in the pom.xml file - but that is not too helpful if you are not building with maven and you simply download the .jar file available from the downloads page. Thank you for bringing this to my attention. Philip On 11/4/2010 11:35 PM, armin.weg...@bka.bund.de wrote: Hi Philip, thanks, that really helps. Looking into the uimaFIT source code showed that UIMA is missing some documentation that would be really useful. I think I will use uimaFIT from now on. UimaFit should be part of UIMA itself. But you could have mentioned that umiaFIT depends on Apache Commons somewhere. Thanks, Armin -Ursprüngliche Nachricht- Von: user-return-3244-armin.wegner=bka.bund...@uima.apache.org [mailto:user-return-3244-armin.wegner=bka.bund...@uima.apache.org] Im Auftrag von Philip Ogren Gesendet: Montag, 1. November 2010 16:20 An: user@uima.apache.org Betreff: Re: Compare two CASes Hi Armin, I have put together some example code together using uimaFIT to address this very common scenario. There's a wiki page that provides an entry point here: http://code.google.com/p/uimafit/wiki/RunningExperiments Hope this helps. Philip On 11/1/2010 2:09 AM, armin.weg...@bka.bund.de wrote: Hi, how to compare two CASes? I like to compute the f-value for some annotations of an actual CAS compared to a target CAS. To compare two CASes I need them both in one analysis engine. But an analysis engine processes only one CAS at a time. Have I to merge the two CASes first to get a CAS with a target view and a actual view? How do I merge two CASes? Or is an analysis engine the wrong place to do this? Thanks, Armin - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1153 / Virus Database: 424/3230 - Release Date: 10/31/10 - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1153 / Virus Database: 424/3238 - Release Date: 11/04/10
Re: Compare two CASes
Hi Armin, I have put together some example code together using uimaFIT to address this very common scenario. There's a wiki page that provides an entry point here: http://code.google.com/p/uimafit/wiki/RunningExperiments Hope this helps. Philip On 11/1/2010 2:09 AM, armin.weg...@bka.bund.de wrote: Hi, how to compare two CASes? I like to compute the f-value for some annotations of an actual CAS compared to a target CAS. To compare two CASes I need them both in one analysis engine. But an analysis engine processes only one CAS at a time. Have I to merge the two CASes first to get a CAS with a target view and a actual view? How do I merge two CASes? Or is an analysis engine the wrong place to do this? Thanks, Armin - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1153 / Virus Database: 424/3230 - Release Date: 10/31/10
Re: maven src directory for generated code
Marshall, Thank you for this helpful hint. I was just now following up on Richard's tips and was trying to figure out how to get Eclipse to recognize the new source directory. I think I now have a working solution! I will send it around shortly. Philip On 10/21/2010 9:03 AM, Marshall Schor wrote: Hi, I agree - lots of source generation tooling generates to target/generated-sources ... This has the advantage that the normal clean operation removes these. Using m2eclipse Eclipse plugin for maven - by default it will miss these generated directories the first time you import a project as a Maven project. However, the recovery is simple, and only needs doing once: right click the project and select Maven - update project configuration. -Marshall On 10/21/2010 2:30 AM, Richard Matthias Eckart de Castilho wrote: Hello Philip, So, I would much rather have the generated code put into a different source folder that has only generated code in it. Something like src/generated/java or src/output/java or whatever the convention is. My question is whether or not there is a convention and if so could someone point me to a project that does something like this? my understanding is that target/generated-sources/toolname is appropriate - I've seen this in a couple of instances. This is also what is mentioned in the NetBeans wiki on Maven best practices: If your project contains generated source roots that need to appear in the project's source path, please make sure that the Maven plugin generating the sources generates them in the target/generated-sources/toolname directory wheretoolname is folder specific to the Maven plugin used and acts as source root for the generated sources. Most common maven plugins currently follow this pattern in the default configuration. (Source: http://wiki.netbeans.org/MavenBestPractices#Open_existing_project) There is also a Maven plugin which allows to add folders to the Maven source folders list. This may help if e.g. Eclipse does not properly pick up the folder: http://mojo.codehaus.org/build-helper-maven-plugin/usage.html Cheers, Richard No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.862 / Virus Database: 271.1.1/3210 - Release Date: 10/21/10 00:34:00
maven src directory for generated code
We are using maven as the build platform for our uima-based projects. As part of the build org.apache.uima.tools.jcasgen.Jg is invoked to generate java files for the type system. This is performed in the process-resources phase using a plugin configuration previously discussed on this list. We have the target directory for the generated java files set to src/main/java. However, we make it our practice not to check in these files into our repository because they change every time they get generated - and because they are generated there's no point in checking them in in the first place. As the type system gets bigger it gets more and more annoying to have src/main/java filled with directories that are under svn:ignore. This is particularly true when a sub-project imports a type system from a different project and the packaging structure of the generated source code is much different than from the sub-project's code. So, I would much rather have the generated code put into a different source folder that has only generated code in it. Something like src/generated/java or src/output/java or whatever the convention is. My question is whether or not there is a convention and if so could someone point me to a project that does something like this? Thanks, Philip
uimaFIT 1.0.0 released
We are pleased to announce the release of uimaFIT 1.0.0 - a library that provides factories, injection, and testing utilities for UIMA. The following list highlights some of the features uimaFIT provides: * Factories: simplify instantiating UIMA components programmatically without descriptor files. For example, to instantiate an AnalysisEngine a call like this could be made: AnalysisEngineFactory.createPrimitive(MyAEImpl.class, myTypeSystem, paramName, paramValue) * Injection: handles the binding of configuration parameter values to the corresponding member variables in the analysis engines and handles the binding of external resources. For example, to bind a configuration parameter just annotate a member variable with @ConfigurationParameter. Then add one line of code to your initialize method - ConfigurationParameterInitializer.initialize(this, uimaContext). This is handled automatically if you extend the uimaFIT JCasAnnotator_ImplBase class. * Testing: uimaFIT simplifies testing in a number of ways described in the documentation. By making it easy to instantiate your components without descriptor files a large amount of difficult-to-maintain and unnecessary XML can be eliminated from your test code. This makes tests easier to write and maintain. Also, running components as a pipeline can be accomplished with a method call like this: SimplePipeline.runPipeline(reader, ae1, ..., aeN, consumer1, ... consumerN) uimaFIT is licensed with Apache Software License 2.0 and is available from Google Code at: http://uimafit.googlecode.com http://code.google.com/p/uimafit/wiki/Documentation uimaFIT is available via Maven Central. If you use maven for your build environment, then you can add uimaFIT as a dependency to your pom.xml file with the following: dependency groupIdorg.uimafit/groupId artifactIduimafit/artifactId version1.0.0/version /dependency uimaFIT is a collaborative effort between the Center for Computational Pharmacology at the University of Colorado Denver, the Center for Computational Language and Education Research at the University of Colorado at Boulder, and the Ubiquitous Knowledge Processing (UKP) Lab at the Technische Universität Darmstadt. uimaFIT is extensively used by projects being developed by these groups. The uimaFIT development team is: Philip Ogren, University of Colorado, USA Richard Eckart de Castilho, Technische Universität Darmstadt, Germany Steven Bethard, Stanford University, USA with contributions from Fabio Mancinelli, Chris Roeder, Philipp Wetzler, and Torsten Zesch. Please address questions to uimafit-us...@googlegroups.com. uimaFIT requires Java 1.5 or higher and UIMA 2.3.0 or higher.
why use UIMA logging facility?
I am involved in a discussion about how to do logging in uima. We were looking at section 1.2.2 in the tutorial for some motivation as to why we would use the built-in UIMA logging rather than just using e.g. log4j directly - but there doesn't seem to be any. Could someone give us some intuition why it matters which logging library we use? (I have my own suspicions - but that's all they are...) Thanks, Philip
what is the difference between the base CAS view and _InitialView?
I have a component that takes text from the _InitialView view, creates a second view, and posts a modified version of the text to the second view. I had a unit test that was reading in a JCas from an XMI file and running the JCas through my annotator and testing that the text was correctly posted to the second view. Everything was fine. Later, it occurred to me that I should add SofaCapabilities to my annotator which I did. However, my test broke. The reason seems to be because the JCas I get back from the xmi file has the text in the _InitialView but the jCas passed into my annotator's process method is no longer the _InitialView. I tracked this down in the debugger and discovered that this is due to the following lines of code in PrimitiveAnalysisEngine_impl: // Get the right view of the CAS. Sofa-aware components get the base CAS. // Sofa-unaware components get whatever is mapped to the _InitialView. CAS view = ((CASImpl) aCAS).getBaseCAS(); if (!mSofaAware) { view = aCAS.getView(CAS.NAME_DEFAULT_SOFA); } So, to make my test work again I simply injected the following line of code near the top of the process method of my annotator: jCas = jCas.getView(CAS.NAME_DEFAULT_SOFA); This made my problem go away. However, this seems quite unsatisfactory! Can someone explain to me what the difference is between the base CAS that I am now getting in my process method and the _InitialView CAS that I was getting before? Are my annotators supposed to be aware of this difference. Any advice and/or insight on this would be appreciated. Thanks, Philip
Re: Dynamic Annotators
Ram, You might check out the uutuc project at: http://code.google.com/p/uutuc/ The main goal of this project is to make it easier to dynamically describe and instantiate uima components. The project started off as utility classes for unit testing - but has really become a dynamic descriptor framework. We need to revamp the documentation to make this clearer. As a simple example you can instantiate an analysis engine with something like: TypeSystemDescription tsd = TypeSystemDescriptionFactory.createTypeSystemDescription(MyType.class, MyType2.class, ...); AnalysisEngineFactory.createPrimitive(MyComponent.class, tsd, myConfigurationParameterName, configParamValue, ...); It's not a complete library but it has come a long ways since we released it last spring. Cheers, Philip Ram Mohan Yaratapally wrote: Hi I am currently evaluating Apache UIMA for one of our requirement. Can somebody clarify me whether it is possible to load the Annotators and Feature Properties dynamically? - Ram No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.704 / Virus Database: 270.14.61/2497 - Release Date: 11/11/09 12:41:00
Re: Problem reconfiguring after setting a config parameter value
Girish, I have done exactly the same thing as you minus step 2 below without any problems. The only caveat being that this didn't seem (as I recall) to trigger my analysis engines initialize() method and so I had to reread the parameter in my analysis engine's process() method. I don't know how helpful this kind of feedback is Philip Chavan, Girish wrote: Hi All, I am still stuck here. My backup option is to alter the descriptor xml before building an analysis engine with it, but it is kinda hacky, so was hoping for a cleaner solution. Can any of the developers chime in? Thanks, ___ Girish Chavan, MSIS Department of Biomedical Informatics (DBMI) University of Pittsburgh -Original Message- From: Chavan, Girish [mailto:chav...@upmc.edu] Sent: Friday, July 24, 2009 11:01 AM To: uima-user@incubator.apache.org Subject: Problem reconfiguring after setting a config parameter value Hi All, I am using the following code to set the value for the 'ChunkCreaterClass' parameter. I had to add line (2) because without it I was getting a UIMA_IllegalStateException due to the session being null. I am not quite sure if that is the way you set a new session. In any case I stopped getting that error. But now I get a nullpointerexception error ( trace shown below the code). I had a breakpoint set in the ConfigurationManagerImplBase.validateConfigurationParameterSettings method to see what was happening. It entered the method multiple times to validate each AEs params. The first set of calls was triggered after line (1). And it went through smoothly. The next set was triggered after line (4) which threw an exception immediately for the / context. Within the implementation of the validateConfigurationParameterSettings method. I see that ConfigurationParameterDeclarations comes up null for the / context. Not quite sure what I am doing wrong. Any ideas?? CODE: (1)AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier); (2)ae.getUimaContextAdmin().setSession(new Session_impl()); (3)ae.setConfigParameterValue(ChunkCreaterClass, Something); (4)ae.reconfigure(); Exception: java.lang.NullPointerException at org.apache.uima.resource.impl.ConfigurationManagerImplBase.validateConfigurationParameterSettings(ConfigurationManagerImplBase.java:488) at org.apache.uima.resource.impl.ConfigurationManagerImplBase.reconfigure(ConfigurationManagerImplBase.java:235) at org.apache.uima.resource.ConfigurableResource_ImplBase.reconfigure(ConfigurableResource_ImplBase.java:70) Thanks. ___ Girish Chavan, MSIS Department of Biomedical Informatics (DBMI) University of Pittsburgh UPMC Cancer Pavilion, 302D 5150 Centre Avenue Pittsburgh, PA 15232 Office: 412-623-4084 Email: chav...@upmc.edu No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.392 / Virus Database: 270.13.27/2258 - Release Date: 07/24/09 05:58:00
Re: Get the path
One thing that you might consider doing is putting the path information into its own view. That is, create a new view and set its document path to be the path/uri. One advantage of this is that if you have a CollectionReader that is otherwise type system agnostic you don't have to pollute it with a single type for holding this information. This may not be the UIMA way - but we felt for this piece of information that this was a reasonable thing to do. The following class facilitates this: http://cleartk.googlecode.com/svn/trunk/doc/api/src-html/org/cleartk/util/ViewURIUtil.html Here is our type system agnostic file system collection reader which makes use of it: http://cleartk.googlecode.com/svn/trunk/doc/api/src-html/org/cleartk/util/FilesCollectionReader.html Hope this helps. Philip Adam Lally wrote: On Tue, Jul 21, 2009 at 4:25 AM, Radwen ANIBAarad...@gmail.com wrote: Hello every one, Well when playing a little bit with JCAS I was wondering how to get directly the path to the document treated within AE without expressing it directly. What I want to do is to get the path and the document name eg /here/in/this/folder/Document.txt Is there any extension of arg0.getDocumentText() method or something like ? This information isn't build into the framework, but there are some examples showing how to do it. There's a type called SourceDocumentInformation that is populated by the FileSystemCollectionReader and then used in the XMI Writer CAS Consumer (among others). -Adam No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.392 / Virus Database: 270.13.20/2250 - Release Date: 07/20/09 06:16:00
Re: model file problen in opennlp uima wrapper
It may be worth pointing out that there is a very nice set of uima wrappers for OpenNLP available from their sourceforge cvs repository. See http://opennlp.cvs.sourceforge.net/opennlp/. While this is still a work in progress - it is *much* nicer than the example wrappers that ship with UIMA. Burn Lewis wrote: That probably means the OpenNLP library was compiler with a newer version of Java. Try switching to a higher version number. (The message usually indicates the version number, e.g. 6 for Java 1.6) Burn. No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.287 / Virus Database: 270.11.57/2060 - Release Date: 04/15/09 06:34:00
uutuc - uima unit test utility code
We have posted a light-weight set of utility classes that ease the burden of unit testing UIMA components. The project is located at: http://uutuc.googlecode.com/ and is licensed under ASL 2.0. There is very little documentation for this library at the moment - just a bare-bones getting started tutorial that shows a very simple example of unit testing an analysis engine (RoomNumberAnnotator). A fuller tutorial (along with javadocs) are forthcoming. However, uutuc does use itself to unit-test the code (over 90% code coverage) and is now used by ClearTK which has more than 250 tests in it. Both of these projects provide good examples of uutuc usage. It is by no means a complete library - but we have found it very useful for our project(s). Marshall, in response to your earlier email - if it makes sense to fold this into a UIMA project like uimaj-test-util, then I think that would be great. I thought it might be easier initially to clean up what we had written for cleartk and release it as a separate project and see if anyone finds it useful.
proposal for UIMA unit test utility code
We have assembled some misc. utility methods that make unit testing easier to support our UIMA-based project ClearTK. I have come across several scenarios now where I wish that this code was available as a separate project so that I don't have to create a dependency on our entire ClearTK project when I only need a few utility classes for writing unit tests. So, I am thinking about spinning off a separate project that just has this unit-test specific code in it. There really isn't that much code there (I think it's all shoved into one class at the moment) and would really be just a seed for getting something started. One thing the code does is simplify initialization of UIMA components programmatically such that you don't have to use descriptor files in your unit tests. A unit test might start with something like: AnalysisEngine engine = TestsUtil.getAnalysisEngine(MyAnnotator.class, TestsUtil.getTypeSystem(Document.class, Token.class, Sentence.class)); JCas jCas = engine.newJCas(); and then you can go about populating the JCas and testing its contents. The current code is located here: http://code.google.com/p/cleartk/source/browse/trunk/test/src/org/cleartk/util/TestsUtil.java Some example unit tests based on this code can be found here: http://code.google.com/p/cleartk/source/browse/trunk/test/src/org/cleartk/example/ExamplePOSHandlerTests.java Is this something that is of interest? Is there already code in UIMA that helps with this sort of thing? Thanks, Philip
updating payloads
Is it possible to update the payloads of an existing index? I having troubles finding any mention of this on the mailing list archives and it is not obvious that this is possible from the api. I do not want to change the size of the payloads - just update the values. My payloads values depend on values returned from reader.docFreq(). I don't want to update often - just once - and I would settle for a very hackish way of doing it if needed. I suppose in the worst case scenario I could update the .cfs file directly - but now I expose my naivety. Any pointers are greatly appreciated. Thanks, Philip - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
abandoning Groovy
Marshall, This may be frustrating/annoying feedback. Last summer I sent a few emails to the list about unit testing uima components using groovy and probably said a few other positive things about groovy. We have since abandoned groovy for a variety of reasons. Here are a few: - unit tests on code that do anything slightly useful parameterized types did not work correctly on mac/unix - eclipse slows down because it seems user actions are constantly waiting for groovy to compile - refactoring code is not as seamless - it adds one more moving part to e.g. download/install instructions and makes distribution incrementally more complicated - the eclipse plugin is not keeping pace with versions of groovy - so we felt stuck with groovy 1.0. We really aren't groovy experts and so there may be reasonable counterpoints to these annoyances that we encountered. But any perceived advantage of making groovy groovy was outweighed by our general annoyance with this language. I hope others find these comments useful. Philip Marshall Schor wrote: I just committed a very experimental version of the Open Calais annotator to the sandbox, called OpenCalaisGroovy. As you can probably guess from the title, it uses the programming langauge groovy (google groovy) - which is an extension of Java. The details about trying this out are in the Readme.txt file. At this point, the annotator doesn't run on real results from Open Calais, but rather on test data that they make available for download - about 333 news stories. There is a special collection reader that reads the files in this set and sets up a CAS with the input text and a new FeatureStructure containing what OpenCalais would have responded with. A separate annotator takes the response as specified, and constructs Entities, Relations, and Instances, following the model that Open Calais is using. This implementation uses the JCas, and there is a complete set of JCas types defined for all the Entities and Relations that Open Calais produces. This is my first try at doing things in Groovy - so it may not be done in the ideal groovy way :-) - but it's interesting code to look at. The Readme tells how to run it - you'll need the Eclipse Groovy plugin (maven build is not yet working due to bugs in the maven groovy builder supposedly fixed in a SNAPSHOT release, but I'm waiting for the next release candidate - this is still early in the evolution cycle). Feedback / improvement welcomed and appreciated! -Marshall
Re: Type Priorities
Katrin, Yes. There is a penalty for iterating through all the annotations of a given type. Imagine you have a token annotation and a document with 10K tokens (not uncommon). We wrote a method that doesn't have this performance penalty and bypasses the type priorities. Please see: http://cslr.colorado.edu/ClearTK/index.cgi/chrome/site/api/src-html/edu/colorado/cleartk/util/AnnotationRetrieval.html#line.237 Philip Katrin Tomanek wrote: Hi Thilo, Actually, I am using the subiterator functionality and I need to have type priorities to be set. But I don't want the user of a component to be able to alter the type priorities by modifying the descriptors where type priorities typically are set. is that necessary? The user mustn't change the type system either, else the annotator will no longer work. Wouldn't you say it's part of the contract that such metadata is not changed by the user? well, yeah. Could see it that way. But if I know at time of writing the component which types it will use and how the priorities should look like, why should I risk to let the user violate that contract ? But to come back to the true problem: I am asking about type priorities because I am using the subiterator. For me, type priorities are necessary when the types I am interested in are of exactly the same offset as the type with which I constrain my subiterator. So, the question for me is: shall I use the subiterator (and define priorities in the compoennts descriptor) or shall I write my own function(see below) doing what I want without the type priority problems: - SNIP // gives me a list of all abbreviations which are within my entity public ArrayListAbbreviation getAbbreviations(Entity entity, JFSIndexRepository index) { ArrayListAbbreviation abbrev = new ArrayListAbbreviation(); int StartOffset = entity.getBegin(); int EndOffset = entity.getEnt(); Iterator iter = index.getAnnotationIndex(Abbreviation.type).iterator(); while (iter.hasNext()) { Abbreviation currAbbrev = (Abbreviation) iter.next(); if (currAbbrev.getBegin() = startOffset currAbbrev.getEnd() = endOffset) { abbrev.add(currAbbrev); } } return abbrev; } - SNIP Do you see an efficiency disadvantage when using the above function instead of the subiterator? Katrin
Re: How to test a CollectionReader (in Groovy)
I didn't follow the thread closely so I may be wandering here - but I thought I would volunteer my working strategy for testing collection readers in Groovy even though it may be overly simplistic for many situations. My unit tests for our collection readers start off with one line: JCas jCas = TestsUtil.processCR(desc/test/myCRdesc.xml, 0) followed immediately by assertions of what I expect to be in the JCas. The method TestsUtil.processCR looks like this: static JCas processCR(String descriptorFileName, int documentNumber) { XMLInputSource xmlInput = new XMLInputSource(new File(desc/annotators/EmptyAnnotator.xml)) ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(xmlInput) AnalysisEngine analysisEngine = UIMAFramework.produceAnalysisEngine(specifier) JCas jCas = analysisEngine.newJCas() xmlInput = new XMLInputSource(new File(descriptorFileName)) specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(xmlInput) CollectionReader collectionReader = UIMAFramework.produceCollectionReader(specifier) for(i in 0..documentNumber) { jCas.reset() collectionReader.getNext(jCas.getCas()) } return jCas } Where EmptyAnnotator.xml is a descriptor file for an analysis engine that does nothing as follows: public class EmptyAnnotator extends JCasAnnotator_ImplBase{ public void process(JCas jCas) throws AnalysisEngineProcessException{ //this annotator does nothing! } } I hope this is helpful.
Re: ClassCastException thrown when using subiterator and moveTo()
btw, the work around I had posted is *much* slower than the fixed subiterator method - for both creating the iterator and iterating through it. More than an order of magnitude slower (roughly 15x's slower). Thanks for the fix! Thilo Goetz wrote: public static FSIterator getWindowIterator(JCas jCas, Annotation windowAnnotation, Type type) { ConstraintFactory constraintFactory = jCas.getConstraintFactory(); FeaturePath beginFeaturePath = jCas.createFeaturePath(); beginFeaturePath.addFeature(type.getFeatureByBaseName(begin)); FSIntConstraint intConstraint = constraintFactory.createIntConstraint(); intConstraint.geq(windowAnnotation.getBegin()); FSMatchConstraint beginConstraint = constraintFactory.embedConstraint(beginFeaturePath, intConstraint); FeaturePath endFeaturePath = jCas.createFeaturePath(); endFeaturePath.addFeature(type.getFeatureByBaseName(end)); intConstraint = constraintFactory.createIntConstraint(); intConstraint.leq(windowAnnotation.getEnd()); FSMatchConstraint endConstraint = constraintFactory.embedConstraint(endFeaturePath, intConstraint); FSMatchConstraint windowConstraint = constraintFactory.and(beginConstraint,endConstraint); FSIndex windowIndex = jCas.getAnnotationIndex(type); FSIterator windowIterator = jCas.createFilteredIterator(windowIndex.iterator(), windowConstraint); return windowIterator; } Thilo Goetz wrote: That's a bug. The underlying implementation of the two iterator types you mention is totally different, hence you see this only in one of them. Any chance you could provide a self-contained test case that exhibits this? --Thilo Philip Ogren wrote: I am having difficulty with using the FSIterator returned by the AnnotationIndex.subiterator(AnnotationFS) method. The following is a code fragment: AnnotationIndex annotationIndex = jCas.getAnnotationIndex(tokenType); FSIterator tokenIterator = annotationIndex.subiterator(sentenceAnnotation); annotationIterator.moveTo(tokenAnnotation); Here is the relevant portion of the stack trace: java.lang.ClassCastException: edu.colorado.cslr.dessert.types.Token at java.util.Collections.indexedBinarySearch(Unknown Source) at java.util.Collections.binarySearch(Unknown Source) at org.apache.uima.cas.impl.Subiterator.moveTo(Subiterator.java:224) If I change the second line to the following, then I do not have any problems with an exception being thrown. FSIterator tokenIterator = annotationIndex.iterator(); Is this a bug or some misunderstanding on my part of how subiterator should work? Thanks, Philip
Unit testing with Groovy
Great! I have found that a great way to get started with Groovy is for unit testing. I borrowed a few lines of Java code for obtaining a JCas that I saw on an earlier post (from Marshall I think) that I used to create a little util class that has a single method: static JCas process(String descriptorFileName, String textFileName) { File textFile = new File(textFileName) String text if(textFile.exists()) text = textFile.text else text = textFileName XMLInputSource xmlInput = new XMLInputSource(new File(descriptorFileName)) ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(xmlInput) AnalysisEngine analysisEngine = UIMAFramework.produceAnalysisEngine(specifier) JCas jCas = analysisEngine.newJCas() jCas.setDocumentText(text) analysisEngine.process(jCas) return jCas } With this I can make a unit test very easily: void testTokenizer() { println running token tests ... JCas jCas = TestsUtil.process(tokenDescriptorFile, data/test/docs/sampletext.txt) FSIndex tokenIndex = jCas.getAnnotationIndex(Token.type) assert tokenIndex.size() == 132 ... } I have also started exploring writing uima components with Groovy. I recently put together a file collection reader that uses Groovy's Ant builder which allows one to define includes and excludes for a FileSet in a descriptor file the same way you would in a build.xml file. I haven't made it available yet because it needs a README and it is also dog slow on large corpora. However, it is 100 lines of code and provides nifty functionality for smaller data sets. My point is simply to give an example of what is possible. When I understand the CAS better, it might be cool think about creating a Groovy builder that would be a more dynamic solution than JCas - but that is just a dream at the moment given my time and (lack of) expertise of these two technologies. Thilo Goetz wrote: Fixed for 2.2, thanks for reporting and providing the test case. This Groovy stuff looks pretty cool, I'll have to check it out some more... --Thilo Thilo Goetz wrote: The easiest place to create a test case is on Jira. Open a Jira bug for UIMA, and you can attach the test case. When you do that, please check the box that says something like, ok to include in Apache code (so we can check it in and use it as regression test). Groovy, hm. Never used it before. If it doesn't take me more than 5 min to set up in Eclipse, and I can still debug, not a problem ;-) --Thilo Philip Ogren wrote: Yes. I will throw one together. Would you mind if I used Groovy for this - or is that going to be annoying? Let me know. Also, from an earlier email I saw on this list it seems that attachments are a problem. Is there some place where I could directly load a test case? In the mean time, here is a work around I just put together (I'm still unit testing the code that uses this - so I'm not certain this is bug free): public static FSIterator getWindowIterator(JCas jCas, Annotation windowAnnotation, Type type) { ConstraintFactory constraintFactory = jCas.getConstraintFactory(); FeaturePath beginFeaturePath = jCas.createFeaturePath(); beginFeaturePath.addFeature(type.getFeatureByBaseName(begin)); FSIntConstraint intConstraint = constraintFactory.createIntConstraint(); intConstraint.geq(windowAnnotation.getBegin()); FSMatchConstraint beginConstraint = constraintFactory.embedConstraint(beginFeaturePath, intConstraint); FeaturePath endFeaturePath = jCas.createFeaturePath(); endFeaturePath.addFeature(type.getFeatureByBaseName(end)); intConstraint = constraintFactory.createIntConstraint(); intConstraint.leq(windowAnnotation.getEnd()); FSMatchConstraint endConstraint = constraintFactory.embedConstraint(endFeaturePath, intConstraint); FSMatchConstraint windowConstraint = constraintFactory.and(beginConstraint,endConstraint); FSIndex windowIndex = jCas.getAnnotationIndex(type); FSIterator windowIterator = jCas.createFilteredIterator(windowIndex.iterator(), windowConstraint); return windowIterator; } Thilo Goetz wrote: That's a bug. The underlying implementation of the two iterator types you mention is totally different, hence you see this only in one of them. Any chance you could provide a self-contained test case that exhibits this? --Thilo Philip Ogren wrote: I am having difficulty with using the FSIterator returned by the AnnotationIndex.subiterator(AnnotationFS) method. The following is a code fragment: AnnotationIndex annotationIndex = jCas.getAnnotationIndex(tokenType); FSIterator tokenIterator = annotationIndex.subiterator(sentenceAnnotation); annotationIterator.moveTo(tokenAnnotation); Here is the relevant
typeSystemInit() method in CasAnnotator_ImplBase but not JCasAnnotator_ImplBase
Thilo had pointed me towards the method typeSystemInit() in a recent posting as a way of getting type system information in an annotator. Is there a reason that this method exists in CasAnnotator_ImplBase but not JCasAnnotator_ImplBase? Or is this an omission? My intuition is that might have been left out on purpose because when using the JCas you typically would have a fixed type system that is determined at compile time. Still, it seems useful to be able to override typeSystemInit() even if it is only called once. I didn't really think carefully about which Annotator_ImplBase I should use. Are there far reaching consequences I should consider - or is it really just about convenience at the API level? Thanks, Philip
[jira] Updated: (UIMA-464) ClassCastException thrown when using subiterator and moveTo()
[ https://issues.apache.org/jira/browse/UIMA-464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Ogren updated UIMA-464: -- Attachment: UIMA-464.zip Please see README in the top level of the directory. If there is something I could have done to make this fit better into your regression test suite, then please let me know. ClassCastException thrown when using subiterator and moveTo() - Key: UIMA-464 URL: https://issues.apache.org/jira/browse/UIMA-464 Project: UIMA Issue Type: Bug Components: Core Java Framework Affects Versions: 2.1 Environment: N/A Reporter: Philip Ogren Attachments: UIMA-464.zip I am having difficulty with using the FSIterator returned by the AnnotationIndex.subiterator(AnnotationFS) method. The following is a code fragment: AnnotationIndex annotationIndex = jCas.getAnnotationIndex(tokenType); FSIterator tokenIterator = annotationIndex.subiterator(sentenceAnnotation); tokenIterator.moveTo(tokenAnnotation); Here is the relevant portion of the stack trace: java.lang.ClassCastException: edu.colorado.cslr.dessert.types.Token at java.util.Collections.indexedBinarySearch(Unknown Source) at java.util.Collections.binarySearch(Unknown Source) at org.apache.uima.cas.impl.Subiterator.moveTo(Subiterator.java:224) If I change the second line to the following, then I do not have any problems with an exception being thrown. FSIterator tokenIterator = annotationIndex.iterator(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: subtyping uima.cas.String
Attached is a type system descriptor file that isolates the bug. I cannot create a subtype of TestType in the CDE. Philip Ogren wrote: I was just putting some unit tests together and was editing a type system and noticed that I can't seem to subtype a type that is a subtype of uima.cas.String in the CDE. I thought this was maybe a bug that I should bring attention to. For my part, I'm not concerned with the ability to create nested subtypes of uima.cas.String - so apologies if this is noise. ?xml version=1.0 encoding=UTF-8? typeSystemDescription xmlns=http://uima.apache.org/resourceSpecifier; nameTestTypeSystem/name description/description version1.0/version vendor/vendor types typeDescription nameTestType/name description/ supertypeNameuima.cas.String/supertypeName /typeDescription typeDescription nameTestType2/name description/ supertypeNameuima.tcas.Annotation/supertypeName /typeDescription typeDescription nameTestType3/name description/ supertypeNameTestType2/supertypeName /typeDescription /types /typeSystemDescription
Re: subtyping uima.cas.String
Sorry for the noise! A little investigation reveals that this behavior is almost certainly by design. Changing the source by hand gives an error message that says don't do that and section 2.3.4 also of the UIMA References also documents this. Philip Ogren wrote: Attached is a type system descriptor file that isolates the bug. I cannot create a subtype of TestType in the CDE. Philip Ogren wrote: I was just putting some unit tests together and was editing a type system and noticed that I can't seem to subtype a type that is a subtype of uima.cas.String in the CDE. I thought this was maybe a bug that I should bring attention to. For my part, I'm not concerned with the ability to create nested subtypes of uima.cas.String - so apologies if this is noise. ?xml version=1.0 encoding=UTF-8? typeSystemDescription xmlns=http://uima.apache.org/resourceSpecifier; nameTestTypeSystem/name description/description version1.0/version vendor/vendor types typeDescription nameTestType/name description/ supertypeNameuima.cas.String/supertypeName /typeDescription typeDescription nameTestType2/name description/ supertypeNameuima.tcas.Annotation/supertypeName /typeDescription typeDescription nameTestType3/name description/ supertypeNameTestType2/supertypeName /typeDescription /types /typeSystemDescription No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.472 / Virus Database: 269.8.13/843 - Release Date: 6/10/2007 1:39 PM
Re: Human annotation tool for UIMA
Here is a link to the svn repository which has a branch for the CasEditor. I have not found a documentation page for it yet if there is one. http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/ Andrew Borthwick wrote: As per the below, could someone please post a link to the CasEditor to uima-user so we don't all have to go hunting for it? Thanks, Andrew Borthwick On 6/7/07, Philip Ogren [EMAIL PROTECTED] wrote: Also note that we have a contribution by Joern Kottmann in the sandbox called CAS editor. This is Eclipse based tooling that also allows you to manually create annotations. It is still under development, but maybe there could be some cross-fertilization. The CAS editor is very hard to find! You might consider putting a link on the sandbox page. I will take a look at this. If nothing else, Knowtator might serve as a source of requirements / feature requests for the CasEditor. Also, calculating IAA is non-trivial for a variety of reasons which deserve their own conversation. Do you have any notion what Joern's commitment is to the project?
problem creating Index Collection Descriptor File
I have three related questions that I decided to split up into three messages. I composed them as one email initially and decided I could be spawning a hard-to-traverse thread. Advanced apologies for the inundation. I am trying to create an Index Collection Descriptor File so that I can have my index definitions in one place and unable to do so. I dialog comes up requesting a type system descriptor file as a context for the descriptor file. When I specify my type system descriptor file and click ok/finish the component descriptor editor gives me an error message and a details button. When I click on the the details button I get: java.lang.ClassCastException: java.io.File at org.apache.uima.taeconfigurator.editors.MultiPageEditor.loadContext(MultiPageEditor.java:1479) at org.apache.uima.taeconfigurator.editors.MultiPageEditor.setFsIndexCollection(MultiPageEditor.java:1505) ... at org.eclipse.core.launcher.Main.basicRun(Main.java:280) at org.eclipse.core.launcher.Main.run(Main.java:977) at org.eclipse.core.launcher.Main.main(Main.java:952) Am I doing something wrong or is this functionality broken? Thanks, Philip
finding annotations relative to other annotations
Is there any simple way to ask for the token 3 to the left of my current token? I can't find anything that is built into the default annotation index, and so I have defined an index for this in the descriptor file. In order to do this I define a feature in my token type that keeps track of the index of each token and create an FSIndex on that feature. I can then use the FSIndex as follows: FSIndex fsIndex = jCas.getFSIndexRepository().getIndex(my.token.place.index); MyToken targetAnnotation = new MyToken(jCas); targetAnnotation.setIndex(targetIndex); //target index is the index of the token I want to find return fsIndex.find(targetAnnotation); Does this make sense? It works - but I just want to make sure that I am not creating extra work for myself. I actually have several types that I want to be able to access this way and so I've created a super-type that has the 'index' feature. This allows me to create a set of utility methods that I can use for retrieving e.g. tokens, sentences, constituents, etc. at a specific index or relative to other annotations of the same type. The only challenge is making sure I retrieve (and manage) the appropriate index when one of these methods is called. Any advice here is appreciated. Thanks, Philip
Re: Human annotation tool for UIMA
My initial thought was to have a CasConsumer that loads annotations directly into Knowtator programmatically, and a CasInitializer that goes the other way. What remains is to have a way to translate/synchronize the Type System in UIMA with the class hierarchy / annotation schema in Knowtator back and forth. Both of these tasks should be fairly straight forward. I would rather do it this way than to muck with translating file formats. Making it possible to go from one to the other is one thing but making it fun and easy is another. Some considerations are: - Knowtator has extra book keeping information that it associates with each annotation such as the annotator, the creation date, and comments - Knowtator allows multiple spans for a single annotation - Knowtator allows you to annotate mentions of classes (e.g. person) or mentions of instances (George Washington) ... and there are always usability issues, etc. see other answers interspersed below. Thilo Goetz wrote: Hi Philip, I downloaded Knowtator and played around with it a bit. It looks pretty slick (although I think it could be improved by following some of the more standard UI conventions, like dialogs with ok and cancel buttons, for example; though I guess that's Protege, not Knowtator). Yes. Protege uses non-modal dialogs by design. There is a lengthy explanation http://protege.stanford.edu/doc/design/ok_and_cancel_buttons.html of this on the Protege website. Also note that we have a contribution by Joern Kottmann in the sandbox called CAS editor. This is Eclipse based tooling that also allows you to manually create annotations. It is still under development, but maybe there could be some cross-fertilization. The CAS editor is very hard to find! You might consider putting a link on the sandbox page. I will take a look at this. If nothing else, Knowtator might serve as a source of requirements / feature requests for the CasEditor. Also, calculating IAA is non-trivial for a variety of reasons which deserve their own conversation. Do you have any notion what Joern's commitment is to the project? [I know it's a bit early to talk about legal issues, but please note that the Mozilla license is incompatible with the Apache license. If any of your code were to move to Apache, it would need to be relicensed under ASL 2.0.] We chose MPL because that is what Protege uses. Re funding: we are optimistic that there will be more UIMA innovation awards by IBM this year. Watch this space for the announcement. I would think that a Knowtator/UIMA integration could be a candidate. Great! We will look forward to seeing the announcement. A human annotation tool that works well with UIMA would be an important addition to our ecosystem, so I'm glad this discussion is happening. BTW, are people familiar with the annotation standards work that is going on at ISO (http://www.tc37sc4.org/)? This bears watching as well, as it might evolve into the annotation standard that everybody's been waiting for. At least that seems to be the plan of the folks working on it ;-). --Thilo
Re: Human annotation tool for UIMA
I'm glad I happened to browse the archive today! I just joined the list today because I have noticed a couple of bugs that I want to post somewhere. So, I developed and maintain Knowtator and am also steeped in UIMA technology - I have been using it for just over a year and a half now. I would love to make Knowtator tightly integrated with UIMA. The frame-based representation that Knowtator uses via Protege (Frames edition) is *very* similar to the type system framework in UIMA. The Protege frames is actually much more expressive and I would be surprised if the representational capability does not completely subsume UIMA's type system representational capability. Knowtator can not make use out of some of the more advanced representational constructs such as slot inheritance (i.e. superslots and subslots, e.g. has-characteristic might subsume has-color) or multiple inheritance - but I think it handles anything that can be represented in a UIMA type system. I think it would be a really nice fit. A one-off solution as described previously is going to be extremely frustrating. Things that you might want to do with an annotation tool include: - calculate inter-annotator agreement in a wide variety of ways - consolidate/adjudicate disagreement between annotators - mundane data management tasks - keep track of who annotated what - merge sets of annotations together - stand-alone annotation on a laptop (a perk for annotators if it is a part-time job that they can do during hours of their own choosing) - work on / visualize subsets of annotations - there are a bajillion user-interface considerations I'm sure there are many other things I haven't thought of off the top of my head. We have been working really hard on Knowtator the last few months and have tried to make it much easier to get started with and more user friendly. If you have looked at it previously and got discouraged, then I encourage you to take a look at the latest version and the updated documentation. It is still a clunky with respect to importing and exporting annotation data (which we are currently working on to make easier) - but having a UIMA solution would make this problem go away for this community. We have created some one-off scripts to go from one to the other that we could possibly make available with a little effort if there is interest. If you have ideas about how this effort could be funded I would be grateful for suggestions. We are considering applying for an Eclipse Innovation Award as an appropriate venue but we don't really know what the odds are of getting it funded for this work. Or, if you have interest in working on this yourself, I would be thrilled to provide expertise. Thanks, Philip include One manual annotation tool that is open source is Knowtator (which is licensed under MPL 1.1). As I understand it, Knowtator is intended for manual annotation entities and relationships in text. It is a layer on top of the Prot�g� open source ontology editor. I'm not really familiar enough with Knowtator to explicitly recommend it. Considering its stated goals and the framework that it was developed on, it seems like it might be particularly well suited to enabling manual annotations for relatively elaborate type systems that have a lot of structure and many common relation annotation types. The flip side is that it may be overkill for the (more common) task of marking up instances of a flat list of named-entity types. In any event, my point here is just that anyone who is thinking of building a mapping from an open source manual annotation tool to UIMA may want to consider Knowtator, especially if they are interested in a lot of expressive power.
creating a type called 'Feature'
If you create a type with the name 'Feature' you get compile errors because of a namespace conflict with the Feature interface. I think this could be easily fixed by simply removing the import statement in the generated code and explicitly providing the fully qualified name for the Feature interface when referenced (e.g. 'org.apache.uima.cas.Feature'). Please advise me if I am submitting this to the wrong list. Thanks, Philip