from:"Philip Ogren"

Spark assembly for YARN/CDH5

2014-10-16 Thread Philip Ogren

Does anyone know if there Spark assemblies are created and available for 
download that have been built for CDH5 and YARN?


Thanks,
Philip

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: creating a distributed index

2014-08-04 Thread Philip Ogren

After playing around with mapPartition I think this does exactly what I 
want.  I can pass in a function to mapPartition that looks like this:


def f1(iter: Iterator[String]): Iterator[MyIndex] = {
val idx: MyIndex = new MyIndex()
while (iter.hasNext) {
val text: String = iter.next;
idx.add(text)
}
List(idx).iterator
}

val indexRDD: RDD[MyIndex] = rdd.mapPartitions(f1, true)

Now I can perform a match operation like so:

val matchRdd: RDD[Match] = indexRDD.flatMap(index = index.`match`(myquery))

I'm sure it won't take much imagination to figure out how to the the 
matching in a batch way.


If anyone has done anything along these lines I'd love to have some 
feedback.


Thanks,
Philip



On 08/04/2014 09:46 AM, Philip Ogren wrote:
This looks like a really cool feature and it seems likely that this 
will be extremely useful for things we are doing.  However, I'm not 
sure it is quite what I need here.  With an inverted index you don't 
actually look items up by their keys but instead try to match against 
some input string.  So, if I created an inverted index on 1M 
strings/documents in a single JVM I could subsequently submit a 
string-valued query to retrieve the n-best matches.  I'd like to do 
something like build ten 1M string/document indexes across ten nodes, 
submit a string-valued query to each of the ten indexes and aggregate 
the n-best matches from the ten sets of results.  Would this be 
possible with IndexedRDD or some other feature of Spark?


Thanks,
Philip


On 08/01/2014 04:26 PM, Ankur Dave wrote:
At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com 
wrote:
It seems that I could do this with mapPartition so that each element 
in a

partition gets added to an index for that partition.
[...]
Would it then be possible to take a string and query each 
partition's index
with it? Or better yet, take a batch of strings and query each 
string in the

batch against each partition's index?
I proposed a key-value store based on RDDs called IndexedRDD that 
does exactly what you described. It uses mapPartitions to construct 
an index within each partition, then exposes get and multiget methods 
to allow looking up values associated with given keys.


It will hopefully make it into Spark 1.2.0. Until then you can try it 
out by merging in the pull request locally: 
https://github.com/apache/spark/pull/1297.


See JIRA for details and slides on how it works: 
https://issues.apache.org/jira/browse/SPARK-2365.


Ankur



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren

It is really nice that Spark RDD's provide functions  that are often 
equivalent to functions found in Scala collections.  For example, I can 
call:


myArray.map(myFx)

and equivalently

myRdd.map(myFx)

Awesome!

My question is this.  Is it possible to write code that works on either 
an RDD or a local collection without having to have parallel 
implementations?  I can't tell that RDD or Array share any supertypes or 
traits by looking at the respective scaladocs. Perhaps implicit 
conversions could be used here.  What I would like to do is have a 
single function whose body is like this:


myData.map(myFx)

where myData could be an RDD[Array[String]] (for example) or an 
Array[Array[String]].


Has anyone had success doing this?

Thanks,
Philip

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren

Thanks Michael,

That is one solution that I had thought of. It seems like a bit of
overkill for the few methods I want to do this for - but I will think
about it. I guess I was hoping that I was missing something more
obvious/easier.

Philip

On 07/21/2014 11:20 AM, Michael Malak wrote:

It's really more of a Scala question than a Spark question, but the standard OO
(not Scala-specific) way is to create your own custom supertype (e.g.
MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD
and MyArray), each of which manually forwards method calls to the corresponding
pre-existing library implementations. Writing all those forwarding method calls
is tedious, but Scala provides at least one bit of syntactic sugar, which
alleviates having to type in twice the parameter lists for each method:
http://stackoverflow.com/questions/8230831/is-method-parameter-forwarding-possible-in-scala

I'm not seeing a way to utilize implicit conversions in this case. Since Scala
is statically (albeit inferred) typed, I don't see a way around having a common
supertype.

On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote:
It is really nice that Spark RDD's provide functions that are often
equivalent to functions found in Scala collections. For example, I can
call:

myArray.map(myFx)

and equivalently

myRdd.map(myFx)

Awesome!

My question is this. Is it possible to write code that works on either
an RDD or a local collection without having to have parallel
implementations? I can't tell that RDD or Array share any supertypes or
traits by looking at the respective scaladocs. Perhaps implicit
conversions could be used here. What I would like to do is have a
single function whose body is like this:

myData.map(myFx)

where myData could be an RDD[Array[String]] (for example) or an
Array[Array[String]].

Has anyone had success doing this?

Thanks,
Philip

Re: Announcing Spark 1.0.1

2014-07-14 Thread Philip Ogren


Hi Patrick,

This is great news but I nearly missed the announcement because it had 
scrolled off the folder view that I have Spark users list messages go 
to.  40+ new threads since you sent the email out on Friday evening.


You might consider having someone on your team create a 
spark-announcement list so that it is easier to disseminate important 
information like this release announcement.


Thanks again for all your hard work.  I know you and the rest of the 
team are getting a million requests a day


Philip


On 07/11/2014 07:35 PM, Patrick Wendell wrote:

I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for JSON data and performance and stability fixes.

Visit the release notes[1] to read about this release or download[2]
the release today.

[1] http://spark.apache.org/releases/spark-release-1-0-1.html
[2] http://spark.apache.org/downloads.html

Multiple SparkContexts with different configurations in same JVM

2014-07-10 Thread Philip Ogren

In various previous versions of Spark (and I believe the current 
version, 1.0.0, as well) we have noticed that it does not seem possible 
to have a local SparkContext and a SparkContext connected to a cluster 
via either a Spark Cluster (i.e. using the Spark resource manager) or a 
YARN cluster.  Is this a known issue?  If not, then I would be happy to 
write up a bug report on what the bad/unexpected behavior is and how to 
reproduce it.  If so, then are there plans to fix this?  Perhaps there 
is a Jira issue I could be pointed to or other related discussion.


Thanks,
Philip

Re: Unit test failure: Address already in use

2014-06-18 Thread Philip Ogren

In my unit tests I have a base class that all my tests extend that has a 
setup and teardown method that they inherit.  They look something like this:


var spark: SparkContext = _

@Before
def setUp() {
Thread.sleep(100L) //this seems to give spark more time to 
reset from the previous test's tearDown

spark = new SparkContext(local, test spark)
}

@After
def tearDown() {
spark.stop
spark = null //not sure why this helps but it does!
System.clearProperty(spark.master.port)
   }


It's been since last fall (i.e. version 0.8.x) since I've examined this 
code and so I can't vouch that it is still accurate/necessary - but it 
still works for me.



On 06/18/2014 12:59 PM, Lisonbee, Todd wrote:


Disabling parallelExecution has worked for me.

Other alternatives I’ve tried that also work include:

1. Using a lock – this will let tests execute in parallel except for 
those using a SparkContext.  If you have a large number of tests that 
could execute in parallel, this can shave off some time.


object TestingSparkContext {

val lock = new Lock()

}

// before you instantiate your local SparkContext

TestingSparkContext.lock.acquire()

// after you call sc.stop()

TestingSparkContext.lock.release()

2. Sharing a local SparkContext between tests.

- This is nice because your tests will run faster.  Start-up and 
shutdown is time consuming (can add a few seconds per test).


- The downside is that your tests are using the same SparkContext so 
they are less independent of each other.  I haven’t seen issues with 
this yet but there are likely some things that might crop up.


Best,

Todd

*From:*Anselme Vignon [mailto:anselme.vig...@flaminem.com]
*Sent:* Wednesday, June 18, 2014 12:33 AM
*To:* user@spark.apache.org
*Subject:* Re: Unit test failure: Address already in use

Hi,

Could your problem come from the fact that you run your tests in 
parallel ?


If you are spark in local mode, you cannot have concurrent spark 
instances running. this means that your tests instantiating 
sparkContext cannot be run in parallel. The easiest fix is to tell sbt 
to not run parallel tests.


This can be done by adding the following line in your build.sbt:

parallelExecution in Test := false

Cheers,

Anselme

2014-06-17 23:01 GMT+02:00 SK skrishna...@gmail.com 
mailto:skrishna...@gmail.com:


Hi,

I have 3 unit tests (independent of each other) in the /src/test/scala
folder. When I run each of them individually using: sbt test-only
test,
all the 3 pass the test. But when I run them all using sbt test,
then they
fail with the warning below. I am wondering if the binding
exception results
in failure to run the job, thereby causing the failure. If so,
what can I do
to address this binding exception? I am running these tests
locally on a
standalone machine (i.e. SparkContext(local, test)).


14/06/17 13:42:48 WARN component.AbstractLifeCycle: FAILED
org.eclipse.jetty.server.Server@3487b78d
mailto:org.eclipse.jetty.server.Server@3487b78d:
java.net.BindException: Address
already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:174)
at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:139)
at
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:77)


thanks



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Unit-test-failure-Address-already-in-use-tp7771.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

Re: Processing audio/video/images

2014-06-02 Thread Philip Ogren

I asked a question related to Marcelo's answer a few months ago. The 
discussion there may be useful:


http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html


On 06/02/2014 06:09 PM, Marcelo Vanzin wrote:

Hi Jamal,

If what you want is to process lots of files in parallel, the best
approach is probably to load all file names into an array and
parallelize that. Then each task will take a path as input and can
process it however it wants.

Or you could write the file list to a file, and then use sc.textFile()
to open it (assuming one path per line), and the rest is pretty much
the same as above.

It will probably be hard to process each individual file in parallel,
unless mp3 and jpg files can be split into multiple blocks that can be
processed separately. In that case, you'd need a custom (Hadoop) input
format that is able to calculate the splits. But it doesn't sound like
that's what you want.



On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha jamalsha...@gmail.com wrote:

Hi,
   How do one process for data sources other than text?
Lets say I have millions of mp3 (or jpeg) files and I want to use spark to
process them?
How does one go about it.


I have never been able to figure this out..
Lets say I have this library in python which works like following:

import audio

song = audio.read_mp3(filename)

Then most of the methods are attached to song or maybe there is another
function which takes song type as an input.

Maybe the above is just rambling.. but how do I use spark to process (say)
audiio files.
Thanks

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Philip Ogren


Hi Pierre,

I asked a similar question on this list about 6 weeks ago.  Here is one 
answer 
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E 
I got that is of particular note:


In the upcoming release of Spark 1.0 there will be a feature that 
provides for exactly what you describe: capturing the information 
displayed on the UI in JSON. More details will be provided in the 
documentation, but for now, anything before 0.9.1 can only go through 
JobLogger.scala, which outputs information in a somewhat arbitrary 
format and will be deprecated soon. If you find this feature useful, you 
can test it out by building the master branch of Spark yourself, 
following the instructions in https://github.com/apache/spark/pull/42.




On 05/22/2014 08:51 AM, Pierre B wrote:

Is there a simple way to monitor the overall progress of an action using
SparkListener or anything else?

I see that one can name an RDD... Could that be used to determine which
action triggered a stage, ... ?


Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren

Have you actually found this to be true?  I have found Spark local mode 
to be quite good about blowing up if there is something non-serializable 
and so my unit tests have been great for detecting this.  I have never 
seen something that worked in local mode that didn't work on the cluster 
because of different serialization requirements between the two.  
Perhaps it is different when using Kryo



On 05/14/2014 04:34 AM, Andras Nemeth wrote:
E.g. if I accidentally use a closure which has something 
non-serializable in it, then my test will happily succeed in local 
mode but go down in flames on a real cluster.

Re: Opinions stratosphere

2014-05-02 Thread Philip Ogren

Great reference! I just skimmed through the results without reading
much of the methodology - but it looks like Spark outperforms
Stratosphere fairly consistently in the experiments. It's too bad the
data sources only range from 2GB to 8GB. Who knows if the apparent
pattern would extend out to 64GB, 128GB, 1TB, and so on...

On 05/01/2014 06:02 PM, Christopher Nguyen wrote:
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually
attempted such a comparative study as a Masters thesis:

http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

According to this snapshot (c. 2013), Stratosphere is different from
Spark in not having an explicit concept of an in-memory dataset (e.g.,
RDD).

In principle this could be argued to be an implementation detail; the
operators and execution plan/data flow are of primary concern in the
API, and the data representation/materializations are otherwise
unspecified.

But in practice, for long-running interactive applications, I consider
RDDs to be of fundamental, first-class citizen importance, and the key
distinguishing feature of Spark's model vs other in-memory
approaches that treat memory merely as an implicit cache.

--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen

On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia
matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote:

I don’t know a lot about it except from the research side, where
the team has done interesting optimization stuff for these types
of applications. In terms of the engine, one thing I’m not sure of
is whether Stratosphere allows explicit caching of datasets
(similar to RDD.cache()) and interactive queries (similar to
spark-shell). But it’s definitely an interesting project to watch.

Matei

On Nov 22, 2013, at 4:17 PM, Ankur Chauhan
achau...@brightcove.com mailto:achau...@brightcove.com wrote:

Hi,

That's what I thought but as per the slides on
http://www.stratosphere.eu they seem to know about spark and the
scala api does look similar.
I found the PACT model interesting. Would like to know if matei
or other core comitters have something to weight in on.

-- Ankur
On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com
mailto:pwend...@gmail.com wrote:

I've never seen that project before, would be interesting to get a
comparison. Seems to offer a much lower level API. For instance
this
is a wordcount program:

https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java

On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan
achau...@brightcove.com mailto:achau...@brightcove.com wrote:
Hi,

I was just curious about
https://github.com/stratosphere/stratosphere
and how does spark compare to it. Anyone has any experience
with it to make
any comments?

-- Ankur

RDD.tail()

2014-04-14 Thread Philip Ogren

Has there been any thought to adding a tail() method to RDD?  It would 
be really handy to skip over the first item in an RDD when it contains 
header information.  Even better would be a drop(int) function that 
would allow you to skip over several lines of header information.  Our 
attempts to do something equivalent with a filter() call seem a bit 
contorted.  Any thoughts?


Thanks,
Philip

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren

This is great news thanks for the update!  I will either wait for the 
1.0 release or go and test it ahead of time from git rather than trying 
to pull it out of JobLogger or creating my own SparkListener.



On 04/02/2014 06:48 PM, Andrew Or wrote:

Hi Philip,

In the upcoming release of Spark 1.0 there will be a feature that 
provides for exactly what you describe: capturing the information 
displayed on the UI in JSON. More details will be provided in the 
documentation, but for now, anything before 0.9.1 can only go through 
JobLogger.scala, which outputs information in a somewhat arbitrary 
format and will be deprecated soon. If you find this feature useful, 
you can test it out by building the master branch of Spark yourself, 
following the instructions in https://github.com/apache/spark/pull/42.


Andrew


On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


What I'd like is a way to capture the information provided on the
stages page (i.e. cluster:4040/stages via IndexPage).  Looking
through the Spark code, it doesn't seem like it is possible to
directly query for specific facts such as how many tasks have
succeeded or how many total tasks there are for a given active
stage.  Instead, it looks like all the data for the page is
generated at once using information from the JobProgressListener.
It doesn't seem like I have any way to programmatically access
this information myself.  I can't even instantiate my own
JobProgressListener because it is spark package private.  I could
implement my SparkListener and gather up the information myself.
 It feels a bit awkward since classes like Task and TaskInfo are
also spark package private.  It does seem possible to gather up
what I need but it seems like this sort of information should just
be available without by implementing a custom SparkListener (or
worse screen scraping the html generated by StageTable!)

I was hoping that I would find the answer in MetricsServlet which
is turned on by default.  It seems that when I visit
http://cluster:4040/metrics/json/ I should be able to get
everything I want but I don't see the basic stage/task progress
information I would expect.  Are there special metrics properties
that I should set to get this info?  I think this would be the
best solution - just give it the right URL and parse the resulting
JSON - but I can't seem to figure out how to do this or if it is
possible.

Any advice is appreciated.

Thanks,
Philip



On 04/01/2014 09:43 AM, Philip Ogren wrote:

Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your own
investigation.  Any findings would be appreciated.

Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show
users the current progress of map and reduce job. After
looking at the api document, I don't find anything for
this. However, in Spark UI, I could see the progress of
the task. Is there anything I miss?

Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren

I can appreciate the reluctance to expose something like the 
JobProgressListener as a public interface.  It's exactly the sort of 
thing that you want to deprecate as soon as something better comes along 
and can be a real pain when trying to maintain the level of backwards 
compatibility  that we all expect from commercial grade software.  
Instead of simply marking it private and therefore unavailable to Spark 
developers, it might be worth incorporating something like a @Beta 
annotation 
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html 
which you could sprinkle liberally throughout Spark that communicates 
hey use this if you want to cause its here now and don't come crying 
if we rip it out or change it later.  This might be better than simply 
marking so many useful functions/classes as private.  I bet such an 
annotation could generate a compile warning/error for those who don't 
want to risk using them.



On 04/02/2014 06:40 PM, Patrick Wendell wrote:

Hey Phillip,

Right now there is no mechanism for this. You have to go in through 
the low level listener interface.


We could consider exposing the JobProgressListener directly - I think 
it's been factored nicely so it's fairly decoupled from the UI. The 
concern is this is a semi-internal piece of functionality and 
something we might, e.g. want to change the API of over time.


- Patrick


On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


What I'd like is a way to capture the information provided on the
stages page (i.e. cluster:4040/stages via IndexPage).  Looking
through the Spark code, it doesn't seem like it is possible to
directly query for specific facts such as how many tasks have
succeeded or how many total tasks there are for a given active
stage.  Instead, it looks like all the data for the page is
generated at once using information from the JobProgressListener.
It doesn't seem like I have any way to programmatically access
this information myself.  I can't even instantiate my own
JobProgressListener because it is spark package private.  I could
implement my SparkListener and gather up the information myself.
 It feels a bit awkward since classes like Task and TaskInfo are
also spark package private.  It does seem possible to gather up
what I need but it seems like this sort of information should just
be available without by implementing a custom SparkListener (or
worse screen scraping the html generated by StageTable!)

I was hoping that I would find the answer in MetricsServlet which
is turned on by default.  It seems that when I visit
http://cluster:4040/metrics/json/ I should be able to get
everything I want but I don't see the basic stage/task progress
information I would expect.  Are there special metrics properties
that I should set to get this info?  I think this would be the
best solution - just give it the right URL and parse the resulting
JSON - but I can't seem to figure out how to do this or if it is
possible.

Any advice is appreciated.

Thanks,
Philip



On 04/01/2014 09:43 AM, Philip Ogren wrote:

Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your own
investigation.  Any findings would be appreciated.

Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show
users the current progress of map and reduce job. After
looking at the api document, I don't find anything for
this. However, in Spark UI, I could see the progress of
the task. Is there anything I miss?

Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Philip Ogren

What I'd like is a way to capture the information provided on the stages 
page (i.e. cluster:4040/stages via IndexPage).  Looking through the 
Spark code, it doesn't seem like it is possible to directly query for 
specific facts such as how many tasks have succeeded or how many total 
tasks there are for a given active stage.  Instead, it looks like all 
the data for the page is generated at once using information from the 
JobProgressListener. It doesn't seem like I have any way to 
programmatically access this information myself.  I can't even 
instantiate my own JobProgressListener because it is spark package 
private.  I could implement my SparkListener and gather up the 
information myself.  It feels a bit awkward since classes like Task and 
TaskInfo are also spark package private.  It does seem possible to 
gather up what I need but it seems like this sort of information should 
just be available without by implementing a custom SparkListener (or 
worse screen scraping the html generated by StageTable!)


I was hoping that I would find the answer in MetricsServlet which is 
turned on by default.  It seems that when I visit 
http://cluster:4040/metrics/json/ I should be able to get everything I 
want but I don't see the basic stage/task progress information I would 
expect.  Are there special metrics properties that I should set to get 
this info?  I think this would be the best solution - just give it the 
right URL and parse the resulting JSON - but I can't seem to figure out 
how to do this or if it is possible.


Any advice is appreciated.

Thanks,
Philip


On 04/01/2014 09:43 AM, Philip Ogren wrote:

Hi DB,

Just wondering if you ever got an answer to your question about 
monitoring progress - either offline or through your own 
investigation.  Any findings would be appreciated.


Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show users the 
current progress of map and reduce job. After looking at the api 
document, I don't find anything for this. However, in Spark UI, I 
could see the progress of the task. Is there anything I miss?


Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/

RDD[URI]

2014-01-30 Thread Philip Ogren

In my Spark programming thus far my unit of work has been a single row 
from an hdfs file by creating an RDD[Array[String]] with something like:


spark.textFile(path).map(_.split(\t))

Now, I'd like to do some work over a large collection of files in which 
the unit of work is a single file (rather than a row from a file.)  Does 
Spark anticipate users creating an RDD[URI] or RDD[File] or some such 
and supporting actions and transformations that one might want to do on 
such an RDD?  Any advice and/or code snippets would be appreciated!


Thanks,
Philip

Re: RDD[URI]

2014-01-30 Thread Philip Ogren

Thank you for the links! These look very useful.

I do not have a precise use case - at this point I'm just exploring what
is possible/feasible. Like the blog suggests, I might have a bunch of
images lying around and might want to collect meta-data from them. In
my case, I do a lot of NLP and so I would like to process text from a
large collection of documents, perhaps after running through Tika. Both
of these use cases seem closely related from a Spark user's perspective.

On 1/30/2014 11:02 AM, Nick Pentreath wrote:
What is the precise use case and reasoning behind wanting to work on a
File as the record in an RDD?

CombineFileInputFormat may be useful in some way:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java

—
Sent from Mailbox https://www.dropbox.com/mailbox for iPhone

On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen c...@adatao.com
mailto:c...@adatao.com wrote:

Philip, I guess the key problem statement is the large collection
of part? If so this may be helpful, at the HDFS level:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.

Otherwise you can always start with an RDD[fileUri] and go from
there to an RDD[(fileUri, read_contents)].

Sent while mobile. Pls excuse typos etc.

On Jan 30, 2014 9:13 AM, 尹绪森 yinxu...@gmail.com
mailto:yinxu...@gmail.com wrote:

I am also interested in this. My solution now is making a file
to a line of string, i.e. deleting all '\n', then adding
filename as the head of line with a space.

[filename] [space] [content]

Anyone have better ideas ?

2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com 写道：

In my Spark programming thus far my unit of work has been
a single row from an hdfs file by creating an
RDD[Array[String]] with something like:

spark.textFile(path).map(_.split(\t))

Now, I'd like to do some work over a large collection of
files in which the unit of work is a single file (rather
than a row from a file.) Does Spark anticipate users
creating an RDD[URI] or RDD[File] or some such and
supporting actions and transformations that one might want
to do on such an RDD? Any advice and/or code snippets
would be appreciated!

Thanks,
Philip

various questions about yarn-standalone vs. yarn-client

2014-01-30 Thread Philip Ogren

I have a few questions about yarn-standalone and yarn-client deployment 
modes that are described on the Launching Spark on YARN 
http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page.


1) Can someone give me a basic conceptual overview?  I am struggling 
with understanding the difference between yarn-standalone and 
yarn-client deployment modes.  I understand that yarn-standalone runs on 
the name node and that yarn-client can be run from a remote machine - 
but otherwise don't understand how they are different.  It seems like 
having yarn-client is the obvious better approach because it can run 
from anywhere - but presumably, there is some advantage to having 
yarn-standalone (otherwise, why not just run yarn-client on the name 
node or from a remote machine.)  I'm also curious to know what 
standalone refers to here.


2) I was able to run the SparkPi in yarn-client mode from a simple scala 
main method by providing only SPARK_JAR and SPARK_YARN_APP_JAR 
environment variables and by putting the various *-site.xml files on my 
classpath.  That is, I didn't call run-example - just called my Scala 
app directly.  We've had troubles duplicating this success on our own 
app and are in the process of applying the patch detailed here:


https://github.com/apache/incubator-spark/pull/371

However, one think that I think I learned is that Spark doesn't have to 
be installed on the name node.  Is that correct?  Should I need to have 
Spark installed at all either on my remote machine or on the name node?  
It would be great if all that was needed were the SPARK_JAR and the 
SPARK_YARN_APP_JAR.


3) Finally, is it possible to pre-stage the assembly jar files so they 
don't need to be copied over every time I start a new Spark job in 
yarn-client mode?  Any advice here is appreciated.


Thanks!
Philip

Re: Anyone know hot to submit spark job to yarn in java code?

2014-01-15 Thread Philip Ogren

Great question! I was writing up a similar question this morning and
decided to investigate some more before sending. Here's what I'm
trying. I have created a new scala project that contains only
spark-examples-assembly-0.8.1-incubating.jar and
spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar on the
classpath and I am trying to create a yarn-client SparkContext with the
following:

val spark = new SparkContext(yarn-client, my-app)

My hope is to run this on my laptop and have it execute/connect on the
yarn application master. The hope is that if I can get this to work,
then I can do the same from a web application. I'm trying to unpack
run-example.sh, compute-classpath, SparkPi, *.yarn.Client to figure out
what environment variables I need to set up etc.

I grabbed all the .xml files out of my clusters conf directory (in my
case /etc/hadoop/conf.cloudera.yarn) such as e.g. yarn-site.xml and put
them on my classpath. I also set up environment variables SPARK_JAR,
SPARK_YARN_APP_JAR, SPARK_YARN_USER_ENV, SPARK_HOME.

When I run my simple scala script, I get the following error:

Exception in thread main org.apache.spark.SparkException: Yarn
application already ended,might be killed or not able to launch
application master.
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApp(YarnClientSchedulerBackend.scala:95)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:72)
at
org.apache.spark.scheduler.cluster.ClusterScheduler.start(ClusterScheduler.scala:119)

at org.apache.spark.SparkContext.init(SparkContext.scala:273)
at SparkYarnClientExperiment$.main(SparkYarnClientExperiment.scala:14)
at SparkYarnClientExperiment.main(SparkYarnClientExperiment.scala)

I can look at my yarn UI and see that it registers a failed application,
so I take this as incremental progress. However, I'm not sure how to
troubleshoot what I'm doing from here or if what I'm trying to do is
even sensible/possible. Any advice is appreciated.

Thanks,
Philip

On 1/15/2014 11:25 AM, John Zhao wrote:

Now I am working on a web application and I want to submit a spark job to
hadoop yarn.
I have already do my own assemble and can run it in command line by the
following script:

export YARN_CONF_DIR=/home/gpadmin/clusterConfDir/yarn
export
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar
./spark-class org.apache.spark.deploy.yarn.Client --jar
./examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
--class org.apache.spark.examples.SparkPi --args yarn-standalone --num-workers
3 --master-memory 1g --worker-memory 512m --worker-cores 1

It works fine.
The I realized that it is hard to submit the job from a web application .Looks
like the spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar or
spark-examples-assembly-0.8.1-incubating.jar is a really big jar. I believe it
contains everything .
So my question is :
1) when I run the above script, which jar is beed submitted to the yarn server ?
2) It loos like the spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar
plays the role of client side and spark-examples-assembly-0.8.1-incubating.jar
goes with spark runtime and examples which will be running in yarn, am I right?
3) Does anyone have any similar experience ? I did lots of hadoop MR stuff and
want follow the same logic to submit spark job. For now I can only find the
command line way to submit spark job to yarn. I believe there is a easy way to
integration spark in a web allocation.

Thanks.
John.

Re: Anyone know hot to submit spark job to yarn in java code?

2014-01-15 Thread Philip Ogren

My problem seems to be related to this:
https://issues.apache.org/jira/browse/MAPREDUCE-4052

So, I will try running my setup from a Linux client and see if I have
better luck.

On 1/15/2014 11:38 AM, Philip Ogren wrote:
Great question! I was writing up a similar question this morning and
decided to investigate some more before sending. Here's what I'm
trying. I have created a new scala project that contains only
spark-examples-assembly-0.8.1-incubating.jar and
spark-assembly-0.8.1-incubating-hadoop2.2.0-cdh5.0.0-beta-1.jar on the
classpath and I am trying to create a yarn-client SparkContext with
the following:

val spark = new SparkContext(yarn-client, my-app)

My hope is to run this on my laptop and have it execute/connect on the
yarn application master. The hope is that if I can get this to work,
then I can do the same from a web application. I'm trying to unpack
run-example.sh, compute-classpath, SparkPi, *.yarn.Client to figure
out what environment variables I need to set up etc.

I grabbed all the .xml files out of my clusters conf directory (in my
case /etc/hadoop/conf.cloudera.yarn) such as e.g. yarn-site.xml and
put them on my classpath. I also set up environment variables
SPARK_JAR, SPARK_YARN_APP_JAR, SPARK_YARN_USER_ENV, SPARK_HOME.

When I run my simple scala script, I get the following error:

at org.apache.spark.SparkContext.init(SparkContext.scala:273)
at
SparkYarnClientExperiment$.main(SparkYarnClientExperiment.scala:14)

at SparkYarnClientExperiment.main(SparkYarnClientExperiment.scala)

I can look at my yarn UI and see that it registers a failed
application, so I take this as incremental progress. However, I'm not
sure how to troubleshoot what I'm doing from here or if what I'm
trying to do is even sensible/possible. Any advice is appreciated.

Thanks,
Philip

On 1/15/2014 11:25 AM, John Zhao wrote:
Now I am working on a web application and I want to submit a spark
job to hadoop yarn.
I have already do my own assemble and can run it in command line by
the following script:

export YARN_CONF_DIR=/home/gpadmin/clusterConfDir/yarn
export
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar
./spark-class org.apache.spark.deploy.yarn.Client --jar
./examples/target/scala-2.9.3/spark-examples-assembly-0.8.1-incubating.jar
--class org.apache.spark.examples.SparkPi --args yarn-standalone
--num-workers 3 --master-memory 1g --worker-memory 512m --worker-cores 1

So my question is :
1) when I run the above script, which jar is beed submitted to the
yarn server ?
2) It loos like the
spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar plays the role
of client side and spark-examples-assembly-0.8.1-incubating.jar goes
with spark runtime and examples which will be running in yarn, am I
right?
3) Does anyone have any similar experience ? I did lots of hadoop MR
stuff and want follow the same logic to submit spark job. For now I
can only find the command line way to submit spark job to yarn. I
believe there is a easy way to integration spark in a web allocation.

Thanks.
John.

rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren


I have a very simple Spark application that looks like the following:


var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(, ))
println(myRdd.count)

myRdd.saveAsTextFile(hdfs://myserver:8020/mydir)
myRdd.saveAsTextFile(target/mydir/)


The println statements work as expected.  The first saveAsTextFile 
statement also works as expected.  The second saveAsTextFile statement 
does not (even if the first is commented out.)  I get the exception 
pasted below.  If I inspect target/mydir I see that there is a 
directory called 
_temporary/0/_temporary/attempt_201401020953__m_00_1 which 
contains an empty part-0 file.  It's curious because this code 
worked before with Spark 0.8.0 and now I am running on Spark 0.8.1. I 
happen to be running this on Windows in local mode at the moment.  
Perhaps I should try running it on my linux box.


Thanks,
Philip


Exception in thread main org.apache.spark.SparkException: Job aborted: 
Task 2.0:0 failed more than 0 times; aborting job 
java.lang.NullPointerException
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at 
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

Re: rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren

Not really. In practice I write everything out to HDFS and that is
working fine. But I write lots of unit tests and example scripts and it
is convenient to be able to test a Spark application (or sequence of
spark functions) in a very local way such that it doesn't depend on any
outside infrastructure (e.g. an HDFS server.) So, it is convenient to
write out a small amount of data locally and manually inspect the
results - esp. as I'm building up a unit or regression test.

So, ultimately writing results out to a local file isn't that important
to me. However, I was just trying to run a simple example script that
worked before and is now not working.

Thanks,
Philip

On 1/2/2014 10:28 AM, Andrew Ash wrote:
You want to write it to a local file on the machine? Try using
file:///path/to/target/mydir/ instead

I'm not sure what behavior would be if you did this on a multi-machine
cluster though -- you may get a bit of data on each machine in that
local directory.

On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:

I have a very simple Spark application that looks like the following:

var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(, ))
println(myRdd.count)

myRdd.saveAsTextFile(hdfs://myserver:8020/mydir)
myRdd.saveAsTextFile(target/mydir/)

The println statements work as expected. The first saveAsTextFile
statement also works as expected. The second saveAsTextFile
statement does not (even if the first is commented out.) I get
the exception pasted below. If I inspect target/mydir I see
that there is a directory called
_temporary/0/_temporary/attempt_201401020953__m_00_1 which
contains an empty part-0 file. It's curious because this code
worked before with Spark 0.8.0 and now I am running on Spark
0.8.1. I happen to be running this on Windows in local mode at
the moment. Perhaps I should try running it on my linux box.

Thanks,
Philip

Exception in thread main org.apache.spark.SparkException: Job
aborted: Task 2.0:0 failed more than 0 times; aborting job
java.lang.NullPointerException
at

org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
at

org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
at

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
at org.apache.spark.scheduler.DAGScheduler.org

http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

Re: rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren

I just tried your suggestion and get the same results with the 
_temporary directory.  Thanks though.


On 1/2/2014 10:28 AM, Andrew Ash wrote:
You want to write it to a local file on the machine?  Try using 
file:///path/to/target/mydir/ instead


I'm not sure what behavior would be if you did this on a multi-machine 
cluster though -- you may get a bit of data on each machine in that 
local directory.



On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


I have a very simple Spark application that looks like the following:


var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(, ))
println(myRdd.count)

myRdd.saveAsTextFile(hdfs://myserver:8020/mydir)
myRdd.saveAsTextFile(target/mydir/)


The println statements work as expected.  The first saveAsTextFile
statement also works as expected.  The second saveAsTextFile
statement does not (even if the first is commented out.)  I get
the exception pasted below.  If I inspect target/mydir I see
that there is a directory called
_temporary/0/_temporary/attempt_201401020953__m_00_1 which
contains an empty part-0 file.  It's curious because this code
worked before with Spark 0.8.0 and now I am running on Spark
0.8.1. I happen to be running this on Windows in local mode at
the moment.  Perhaps I should try running it on my linux box.

Thanks,
Philip


Exception in thread main org.apache.spark.SparkException: Job
aborted: Task 2.0:0 failed more than 0 times; aborting job
java.lang.NullPointerException
at

org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
at

org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
at

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
at org.apache.spark.scheduler.DAGScheduler.org

http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

Re: rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren

Yep - that works great and is what I normally do.

I perhaps should have framed my email as a bug report. The
documentation for saveAsTextFile says you can write results out to a
local file but it doesn't work for me per the described behavior. It
also worked before and now it doesn't. So, it seems like a bug. Should
I file a Jira issue? I haven't done that yet for this project but would
be happy to.

Thanks,
Philip

On 1/2/2014 11:23 AM, Andrew Ash wrote:
For testing, maybe try using .collect and doing the comparison between
expected and actual in memory rather than on disk?

On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:

I just tried your suggestion and get the same results with the
_temporary directory. Thanks though.

On 1/2/2014 10:28 AM, Andrew Ash wrote:

You want to write it to a local file on the machine? Try using
file:///path/to/target/mydir/ instead

I'm not sure what behavior would be if you did this on a
multi-machine cluster though -- you may get a bit of data on each
machine in that local directory.

On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren
philip.og...@oracle.com mailto:philip.og...@oracle.com wrote:

I have a very simple Spark application that looks like the
following:

var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(, ))
println(myRdd.count)

myRdd.saveAsTextFile(hdfs://myserver:8020/mydir)
myRdd.saveAsTextFile(target/mydir/)

The println statements work as expected. The first
saveAsTextFile statement also works as expected. The second
saveAsTextFile statement does not (even if the first is
commented out.) I get the exception pasted below. If I
inspect target/mydir I see that there is a directory called
_temporary/0/_temporary/attempt_201401020953__m_00_1
which contains an empty part-0 file. It's curious
because this code worked before with Spark 0.8.0 and now I am
running on Spark 0.8.1. I happen to be running this on
Windows in local mode at the moment. Perhaps I should try
running it on my linux box.

Thanks,
Philip

Exception in thread main org.apache.spark.SparkException:
Job aborted: Task 2.0:0 failed more than 0 times; aborting
job java.lang.NullPointerException
at

org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
at

org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
at

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at

org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
at

org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
at org.apache.spark.scheduler.DAGScheduler.org

http://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at

org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

Re: multi-line elements

2013-12-24 Thread Philip Ogren


Thank you for pointing me in the right direction!

On 12/24/2013 2:39 PM, suman bharadwaj wrote:
Just one correction, I think NLineInputFormat won't fit your usecase. 
I think you may have to write custom record reader and use 
textinputformat and plug it in spark as show above.


Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj suman@gmail.com 
mailto:suman@gmail.com wrote:


Even I'm new to spark. But I was able to write a custom sentence
input format and sentence record reader which reads multiple lines
of text with record boundary being *[.?!]\s** using Hadoop APIs.
And plugged in the SentenceInputFormat into the spark api as shown
below:

val inputRead = sc.hadoopFile(path to the file in

hdfs,classOf*[SentenceTextInputFormat]*,classOf[LongWritable],classOf[Text]).map(value
=value._2.toString)

In your case, you can use the NLineInputFormat i guess which is
provided by hadoop. And pass it as a parameter.

May be there are better ways to do it.

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren
philip.og...@oracle.com mailto:philip.og...@oracle.com wrote:

I have a file that consists of multi-line records.  Is it
possible to read in multi-line records with a method such as
SparkContext.newAPIHadoopFile?  Or do I need to pre-process
the data so that all the data for one element is in a single line?

Thanks,
Philip

Re: writing to HDFS with a given username

2013-12-13 Thread Philip Ogren

Well, it only uses my user name when I run my application in local mode 
(i.e. spark is running on my laptop with a master url of local.)  Not 
a general solution for you I'm afraid!


On 12/12/2013 5:38 PM, Koert Kuipers wrote:

Hey Philip,
how do you get spark to write to hdfs with your user name? When i use 
spark it writes to hdfs as the user that runs the spark services... i 
wish it read and wrote as me.



On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


When I call rdd.saveAsTextFile(hdfs://...) it uses my username
to write to the HDFS drive.  If I try to write to an HDFS
directory that I do not have permissions to, then I get an error
like this:

Permission denied: user=me, access=WRITE,
inode=/user/you/:you:us:drwxr-xr-x

I can obviously get around this by changing the permissions on the
directory /user/you.  However, is it possible to call
rdd.saveAsText with an alternate username and password?

Thanks,
Philip

exposing spark through a web service

2013-12-13 Thread Philip Ogren


Hi Spark Community,

I would like to expose my spark application/libraries via a web service 
in order to launch jobs, interact with users, etc.  I'm sure there are 
100's of ways to think about doing this each with a variety of 
technology stacks that could be applied.  So, I know there is no right 
answer here.  But, I am curious to know if anyone has a fairly general 
solution that they are happy with that they would recommend or if there 
is any growing consensus in this community towards a solution or family 
of solutions that align with the Spark way of doing things.  Any 
advice or thoughts are appreciated!


Thanks,
Philip

writing to HDFS with a given username

2013-12-12 Thread Philip Ogren

When I call rdd.saveAsTextFile(hdfs://...) it uses my username to 
write to the HDFS drive.  If I try to write to an HDFS directory that I 
do not have permissions to, then I get an error like this:


Permission denied: user=me, access=WRITE, 
inode=/user/you/:you:us:drwxr-xr-x


I can obviously get around this by changing the permissions on the 
directory /user/you.  However, is it possible to call rdd.saveAsText 
with an alternate username and password?


Thanks,
Philip

Re: Fwd: Spark forum question

2013-12-11 Thread Philip Ogren

You might try a more standard windows path.  I typically write to a 
local directory such as target/spark-output.


On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote:
We are trying to test out running Spark 0.8.0 on a Windows box, and 
while we can get it to run all the examples that don't output results 
to disk, we can't get it to write output..


Has anyone been able to write out to a local file on a single node 
windows install without using hdfs?


Here is our test code:

object FileWritingTest {
def main (args: Array[String]): Unit = {
  val sc = new SparkContext(local[1], File Writing Test, null, 
null, null, null);
  val res = sc.parallelize(Range(0, 10), 10).flatMap(p = 
%d.format(p * 10))//generate some work to do
  res.saveAsTextFile(file:///c:/somepath)//save the results 
out to a file

}
}

This works as expected using a unix based system. However, when trying 
to run on a windows cmd shell I get the following errors:


[WARN] 11 Dec 2013 12:00:33 - org.apache.hadoop.util.NativeCodeLoader 
- Unable to load native-hadoop library for your platform... using 
builtin-java classes where applicable
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Saving 
as hadoop file of type (NullWritable, Text)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - 
Starting job: saveAsTextFile at Test.scala:19
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Got job 
0 (saveAsTextFile at Test.scala:19) with 10 output partitions 
(allowLocal=false)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Final 
stage: Stage 0 (saveAsTextFile at Test.scala:19)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Parents 
of final stage: List()
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Missing 
parents: List()
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - 
Submitting Stage 0 (MappedRDD[2] at saveAsTextFile at Test.scala:19), 
which has no missing parents
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - 
Submitting 10 missing tasks from Stage 0 (MappedRDD[2] at 
saveAsTextFile at Test.scala:19)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Size of 
task 0 is 5966 bytes

[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Running 0
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Loss 
was due to org.apache.hadoop.util.Shell$ExitCodeException
org.apache.hadoop.util.Shell$ExitCodeException: chmod: getting 
attributes of 
`/cygdrive/c/somepath/_temporary/_attempt_201312111200__m_00_0/part-0': 
No such file or directory

at org.apache.hadoop.util.Shell.runCommand(Shell.java:261)
at org.apache.hadoop.util.Shell.run(Shell.java:188)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:381)

at org.apache.hadoop.util.Shell.execCommand(Shell.java:467)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:450)
at 
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:593)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:584)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:427)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:465)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:781)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:118)
at 
org.apache.hadoop.mapred.SparkHadoopWriter.open(SparkHadoopWriter.scala:86)
at 
org.apache.spark.rdd.PairRDDFunctions.writeToFile$1(PairRDDFunctions.scala:667)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:680)

at org.apache.spark.scheduler.ResultTask.run(ResultTask.scala:99)
at 
org.apache.spark.scheduler.local.LocalScheduler.runTask(LocalScheduler.scala:198)
at 
org.apache.spark.scheduler.local.LocalActor$$anonfun$launchTask$1$$anon$1.run(LocalScheduler.scala:68)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Remove 
TaskSet 0.0 from pool
[INFO] 11 Dec 2013 12:00:33 - org.apache.spark.Logging$class - Failed 
to run saveAsTextFile at Test.scala:19
Exception in thread main

Re: Writing an RDD to Hive

2013-12-09 Thread Philip Ogren

Any chance you could sketch out the Shark APIs that you use for this?
Matei's response suggests that the preferred API is coming in the next
release (i.e. RDDTable class in 0.8.1). Are you building Shark from the
latest in the repo and using that? Or have you figured out other API
calls that accomplish something similar?

Thanks,
Philip

On 12/8/2013 2:44 AM, Christopher Nguyen wrote:
Philip, fwiw we do go with including Shark as a dependency for our
needs, making a fat jar, and it works very well. It was quite a bit of
pain what with the Hadoop/Hive transitive dependencies, but for us it
was worth it.

I hope that serves as an existence proof that says Mt Everest has been
climbed, likely by more than just ourselves. Going forward this should
be getting easier.

--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen

On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:

I have a simple scenario that I'm struggling to implement. I
would like to take a fairly simple RDD generated from a large log
file, perform some transformations on it, and write the results
out such that I can perform a Hive query either from Hive (via
Hue) or Shark. I'm having troubles with the last step. I am able
to write my data out to HDFS and then execute a Hive create table
statement followed by a load data statement as a separate step. I
really dislike this separate manual step and would like to be able
to have it all accomplished in my Spark application. To this end,
I have investigated two possible approaches as detailed below -
it's probably too much information so I'll ask my more basic
question first:

Does anyone have a basic recipe/approach for loading data in an
RDD to a Hive table from a Spark application?

dependency (corresponding to the version of hbase we are running)
to my pom.xml file, I had an slf4j dependency conflict that caused
my current application to explode. I tried the latest released
version and the slf4j dependency problem went away but then the
deprecated class TableOutputFormat no longer exists. Even if
loading the data into hbase were trivially easy (and the detailed
email suggests otherwise) I would then need to query HBase from
Hive which seems a little clunky.

intended to be used as dependency in your Spark code (see this

https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion
and that
https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion.)
It follows then that one can't add a Shark dependency to a pom.xml

file because Shark isn't released via Maven Central (that I can
tell perhaps it's in some other repo?) Of course, there are
ways of creating a local dependency in maven but it starts to feel
very hacky.

I realize that I've given sufficient detail to expose my ignorance
in a myriad of ways. Please feel free to shine light on any of my
misconceptions!

Thanks,
Philip

Writing an RDD to Hive

2013-12-06 Thread Philip Ogren

I have a simple scenario that I'm struggling to implement. I would like
to take a fairly simple RDD generated from a large log file, perform
some transformations on it, and write the results out such that I can
perform a Hive query either from Hive (via Hue) or Shark. I'm having
troubles with the last step. I am able to write my data out to HDFS and
then execute a Hive create table statement followed by a load data
statement as a separate step. I really dislike this separate manual
step and would like to be able to have it all accomplished in my Spark
application. To this end, I have investigated two possible approaches
as detailed below - it's probably too much information so I'll ask my
more basic question first:

Does anyone have a basic recipe/approach for loading data in an RDD to a
Hive table from a Spark application?

1) Load it into HBase via PairRDDFunctions.saveAsHadoopDataset. There is
a nice detailed email on how to do this here
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E.
I didn't get very far thought because as soon as I added an hbase
dependency (corresponding to the version of hbase we are running) to my
pom.xml file, I had an slf4j dependency conflict that caused my current
application to explode. I tried the latest released version and the
slf4j dependency problem went away but then the deprecated class
TableOutputFormat no longer exists. Even if loading the data into hbase
were trivially easy (and the detailed email suggests otherwise) I would
then need to query HBase from Hive which seems a little clunky.

2) So, I decided that Shark might be an easier option. All the examples
provided in their documentation seem to assume that you are using Shark
as an interactive application from a shell. Various threads I've seen
seem to indicate that Shark isn't really intended to be used as
dependency in your Spark code (see this
https://groups.google.com/forum/#%21topic/shark-users/DHhslaOGPLg/discussion
and that
https://groups.google.com/forum/#%21topic/shark-users/2_Ww1xlIgvo/discussion.)
It follows then that one can't add a Shark dependency to a pom.xml file
because Shark isn't released via Maven Central (that I can tell
perhaps it's in some other repo?) Of course, there are ways of creating
a local dependency in maven but it starts to feel very hacky.

I realize that I've given sufficient detail to expose my ignorance in a
myriad of ways. Please feel free to shine light on any of my
misconceptions!

Thanks,
Philip

Re: write data into HBase via spark

2013-12-06 Thread Philip Ogren


Hao,

Thank you for the detailed response!  (even if delayed!)

I'm curious to know what version of hbase you added to your pom file.

Thanks,
Philip

On 11/14/2013 10:38 AM, Hao REN wrote:

Hi, Philip.

Basically, we need* PairRDDFunctions.saveAsHadoopDataset* to do the 
job, as HBase is not a fs, saveAsHadoopFile doesn't work.


*def saveAsHadoopDataset(conf: JobConf): Unit*

this function takes a JobConf parameter which should be configured. 
Essentially, you need to set output format and the name of the output 
table.


*// step 1: JobConf setup:*

// Note: mapred package is used, instead of the mapreduce package 
which contains new hadoop APIs.

*import org.apache.hadoop.hbase.mapred.TableOutputFormat
*
*import org.apache.hadoop.hbase.client._*
// ... some other settings

*val conf = HBaseConfiguration.create()*

// general hbase setting
*conf.set(hbase.rootdir, hdfs:// + nameNodeURL + : + hdfsPort + 
/hbase)*

*conf.setBoolean(hbase.cluster.distributed, true)*
*conf.set(hbase.zookeeper.quorum, hostname)*
*conf.setInt(hbase.client.scanner.caching, 1)*
// ... some other settings

*val jobConfig: JobConf = new JobConf(conf, this.getClass)*

// Note:  TableOutputFormat is used as deprecated code, because 
JobConf is an old hadoop API

*jobConfig.setOutputFormat(classOf[TableOutputFormat])*
*jobConfig.set(TableOutputFormat.OUTPUT_TABLE, outputTable)*


*// step 2: give your mapping:*
*
*
// the last thing todo is mapping your local data schema to the hbase one
// Say, our hbase schema is as below:
// *rowcf:col_1cf:col_2*

// And in spark, you have a RDD of triple, like (1, 2, 3), (4, 5, 6), ...
// So you should map *RDD[(int, int, int)]* to 
*RDD[(ImmutableBytesWritable, Put)]*, where Put carries the mapping.


// You can define a function used by RDD.map, for example:

*def convert(triple: (Int, Int, Int)) = {*
*  val p = new Put(Bytes.toBytes(triple._1))*
*  p.add(Bytes.toBytes(cf), Bytes.toBytes(col_1), 
Bytes.toBytes(triple._2))*
*  p.add(Bytes.toBytes(cf), Bytes.toBytes(col_2), 
Bytes.toBytes(triple._3))*

*  (new ImmutableBytesWritable, p)*
*}*

// Suppose you have a *RDD[(Int, Int, Int)]* called *localData*, then 
writing data to hbase can be done by :


*new 
PairRDDFunctions(localData.map(convert)).saveAsHadoopDataset(jobConfig)*


Voilà. That's all you need. Hopefully, this simple example could help.

Hao.





2013/11/13 Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com


Hao,

If you have worked out the code and turn it into an example that
you can share, then please do!  This task is in my queue of things
to do so any helpful details that you uncovered would be most
appreciated.

Thanks,
Philip



On 11/13/2013 5:30 AM, Hao REN wrote:

Ok, I worked it out.

The following thread helps a lot.


http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C7B4868A9-B83E-4507-BB2A-2721FCE8E738%40gmail.com%3E

Hao


2013/11/12 Hao REN julien19890...@gmail.com
mailto:julien19890...@gmail.com

Could someone show me a simple example about how to write
data into HBase via spark ?

I have checked HbaseTest example, it's only for reading from
HBase.

Thank you.

-- 
REN Hao


Data Engineer @ ClaraVista

Paris, France

Tel: +33 06 14 54 57 24 tel:%2B33%2006%2014%2054%2057%2024




-- 
REN Hao


Data Engineer @ ClaraVista

Paris, France

Tel: +33 06 14 54 57 24 tel:%2B33%2006%2014%2054%2057%2024





--
REN Hao

Data Engineer @ ClaraVista

Paris, France

Tel:  +33 06 14 54 57 24

Re: Writing to HBase

2013-12-05 Thread Philip Ogren


Here's a good place to start:

http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E


On 12/5/2013 10:18 AM, Benjamin Kim wrote:
Does anyone have an example or some sort of starting point code when 
writing from Spark Streaming into HBase?


We currently stream ad server event log data using Flume-NG to tail 
log entries, collect them, and put them directly into a HBase table. 
We would like to do the same with Spark Streaming. But, we would like 
to do the data massaging and simple data analysis before. This will 
cut down the steps in prepping data and the number of tables for our 
data scientists and real-time feedback systems.


Thanks,
Ben

Re: write data into HBase via spark

2013-11-13 Thread Philip Ogren


Hao,

If you have worked out the code and turn it into an example that you can 
share, then please do!  This task is in my queue of things to do so any 
helpful details that you uncovered would be most appreciated.


Thanks,
Philip


On 11/13/2013 5:30 AM, Hao REN wrote:

Ok, I worked it out.

The following thread helps a lot.

http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3C7B4868A9-B83E-4507-BB2A-2721FCE8E738%40gmail.com%3E

Hao


2013/11/12 Hao REN julien19890...@gmail.com 
mailto:julien19890...@gmail.com


Could someone show me a simple example about how to write data
into HBase via spark ?

I have checked HbaseTest example, it's only for reading from HBase.

Thank you.

-- 
REN Hao


Data Engineer @ ClaraVista

Paris, France

Tel: +33 06 14 54 57 24 tel:%2B33%2006%2014%2054%2057%2024




--
REN Hao

Data Engineer @ ClaraVista

Paris, France

Tel:  +33 06 14 54 57 24

code review - splitting columns

2013-11-13 Thread Philip Ogren


Hi Spark community,

I learned a lot the last time I posted some elementary Spark code here.  
So, I thought I would do it again.  Someone politely tell me offline if 
this is noise or unfair use of the list!  I acknowledge that this 
borders on asking Scala 101 questions


I have an RDD[List[String]] corresponding to columns of data and I want 
to split one of the columns using some arbitrary function and return an 
RDD updated with the new columns.  Here is the code I came up with.


def splitColumn(columnsRDD: RDD[List[String]], columnIndex: Int, 
numSplits: Int, splitFx: String = List[String]): RDD[List[String]] = {


def insertColumns(columns: List[String]) : List[String] = {
  val split = columns.splitAt(columnIndex)
  val left = split._1
  val splitColumn = split._2.head
  val splitColumns = splitFx(splitColumn).padTo(numSplits, 
).take(numSplits)

  val right = split._2.tail
  left ++ splitColumns ++ right
}

columnsRDD.map(columns = insertColumns(columns))
  }

Here is a simple test that demonstrates the behavior:

  val spark = new SparkContext(local, test spark)
  val testStrings = List(List(1.2, a b), List(3.4, c d e), 
List(5.6, f))

  var testRDD: RDD[List[String]] = spark.parallelize(testStrings)
  testRDD = splitColumn(testRDD, 0, 2, _.split(\\.).toList)
  testRDD = splitColumn(testRDD, 2, 2, _.split( ).toList) //Line 5
  val actualStrings = testRDD.collect.toList
  assertEquals(4, actualStrings(0).length)
  assertEquals(1, 2, a, b, actualStrings(0).mkString(, ))
  assertEquals(4, actualStrings(1).length)
  assertEquals(3, 4, c, d, actualStrings(1).mkString(, ))
  assertEquals(4, actualStrings(2).length)
  assertEquals(5, 6, f, , actualStrings(2).mkString(, ))


My first concern about this code is that I'm missing out on something 
that does exactly this in the API.  This seems like such a common use 
case that I would not be surprised if there's a readily available way to 
do this.


I'm a little uncertain about the typing of splitColumn - i.e. the first 
parameter and the return value.  It seems like a general solution 
wouldn't require every column to be a String value.  I'm also annoyed 
that line 5 in the test code requires that I use an updated index to 
split what was originally the second column.  This suggests that perhaps 
I should split all the columns that need splitting in one function call 
- but it seems like doing that would require an unwieldy function signature.


Any advice or insight is appreciated!

Thanks,
Philip

code review - counting populated columns

2013-11-08 Thread Philip Ogren


Hi Spark coders,

I wrote my first little Spark job that takes columnar data and counts up 
how many times each column is populated in an RDD.  Here is the code I 
came up with:


//RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile(myfile.tsv).map(line = 
line.split(\t).toList)
//RDD of List[Int] corresponding to populated columns (1 for 
populated and 0 for not populated)
val populatedColumns = columns.map(row = row.map(column = 
if(column.length  0) 1 else 0))

//List[Int] contains sums of the 1's in each column
val counts = populatedColumns.reduce((row1,row2) 
=(row1,row2).zipped.map(_+_))


Any thoughts about the fitness of this code snippet?  I'm a little 
annoyed by creating an RDD full of 1's and 0's in the second line.  The 
if statement feels awkward too.  I was happy to find the zipped method 
for the reduce step.  Any feedback you might have on how to improve this 
code is appreciated.  I'm a newbie to both Scala and Spark.


Thanks,
Philip

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren

Where does 'emit' come from?  I don't see it in the Scala or Spark 
apidocs (though I don't feel very deft at searching either!)


Thanks,
Philip

On 11/8/2013 2:23 PM, Patrick Wendell wrote:

It would be a bit more straightforward to write it like this:

val columns = [same as before]

val counts = columns.flatMap(emit (col_id, 0 or 1) for each
column).reduceByKey(_+ _)

Basically look at each row and emit several records using flatMap.
Each record has an ID for the column (maybe its index) and a flag for
whether it's present.

Then you reduce by key to get the per-column count. Then you can
collect at the end.

- Patrick

On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote:

Hi Spark coders,

I wrote my first little Spark job that takes columnar data and counts up how
many times each column is populated in an RDD.  Here is the code I came up
with:

 //RDD of List[String] corresponding to tab delimited values
 val columns = spark.textFile(myfile.tsv).map(line =
line.split(\t).toList)
 //RDD of List[Int] corresponding to populated columns (1 for populated
and 0 for not populated)
 val populatedColumns = columns.map(row = row.map(column =
if(column.length  0) 1 else 0))
 //List[Int] contains sums of the 1's in each column
 val counts = populatedColumns.reduce((row1,row2)
=(row1,row2).zipped.map(_+_))

Any thoughts about the fitness of this code snippet?  I'm a little annoyed
by creating an RDD full of 1's and 0's in the second line.  The if statement
feels awkward too.  I was happy to find the zipped method for the reduce
step.  Any feedback you might have on how to improve this code is
appreciated.  I'm a newbie to both Scala and Spark.

Thanks,
Philip

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren

Thank you for the pointers.  I'm not sure I was able to fully understand 
either of your suggestions but here is what I came up with.  I started 
with Tom's code but I think I ended up borrowing from Patrick's 
suggestion too.  Any thoughts about my updated solution are more than 
welcome!  I added local variable types for clarify.


  def countPopulatedColumns(tsv: RDD[String]) : RDD[(Int, Int)] = {
//split by tab and zip with index to give column value, column 
index pairs
val sparse : RDD[(String, Int)] = tsv.flatMap(line = 
line.split(\t).zipWithIndex)

//filter out all the zero length values
val dense : RDD[(String, Int)] = sparse.filter(valueIndex = 
valueIndex._1.length0)

//map each column index to one and do the usual reduction
dense.map(valueIndex = (valueIndex._2, 1)).reduceByKey(_+_)
  }

Of course, this can be condensed to a single line but it doesn't seem as 
easy to read as the more verbose code above.  Write-once code like the 
following is why I never liked Perl


  def cpc(tsv: RDD[String]) : RDD[(Int, Int)] = {
tsv.flatMap(_.split(\t).zipWithIndex).filter(ci = 
ci._1.length0).map(ci = (ci._2, 1)).reduceByKey(_+_)

  }

Thanks,
Philip


On 11/8/2013 2:41 PM, Patrick Wendell wrote:

Hey Tom,

reduceByKey will reduce locally on all the nodes, so there won't be
any data movement except to combine totals at the end.

- Patrick

On Fri, Nov 8, 2013 at 1:35 PM, Tom Vacek minnesota...@gmail.com wrote:

Your example requires each row to be exactly the same length, since zipped
will truncate to the shorter of its two arguments.

The second solution is elegant, but reduceByKey involves flying a bunch of
data around to sort the keys.  I suspect it would be a lot slower.  But you
could save yourself from adding up a bunch of zeros:

  val sparseRows = spark.textFile(myfile.tsv).map(line =
line.split(\t).zipWithIndex.filter(_._1.length0))
sparseRows.reduce(mergeAdd(_,_))

You'll have to write a mergeAdd function.  This might not be any faster, but
it does allow variable length rows.


On Fri, Nov 8, 2013 at 3:23 PM, Patrick Wendell pwend...@gmail.com wrote:

It would be a bit more straightforward to write it like this:

val columns = [same as before]

val counts = columns.flatMap(emit (col_id, 0 or 1) for each
column).reduceByKey(_+ _)

Basically look at each row and emit several records using flatMap.
Each record has an ID for the column (maybe its index) and a flag for
whether it's present.

Then you reduce by key to get the per-column count. Then you can
collect at the end.

- Patrick

On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com
wrote:

Hi Spark coders,

I wrote my first little Spark job that takes columnar data and counts up
how
many times each column is populated in an RDD.  Here is the code I came
up
with:

 //RDD of List[String] corresponding to tab delimited values
 val columns = spark.textFile(myfile.tsv).map(line =
line.split(\t).toList)
 //RDD of List[Int] corresponding to populated columns (1 for
populated
and 0 for not populated)
 val populatedColumns = columns.map(row = row.map(column =
if(column.length  0) 1 else 0))
 //List[Int] contains sums of the 1's in each column
 val counts = populatedColumns.reduce((row1,row2)
=(row1,row2).zipped.map(_+_))

Any thoughts about the fitness of this code snippet?  I'm a little
annoyed
by creating an RDD full of 1's and 0's in the second line.  The if
statement
feels awkward too.  I was happy to find the zipped method for the reduce
step.  Any feedback you might have on how to improve this code is
appreciated.  I'm a newbie to both Scala and Spark.

Thanks,
Philip

Where is reduceByKey?

2013-11-07 Thread Philip Ogren

On the front page http://spark.incubator.apache.org/ of the Spark 
website there is the following simple word count implementation:


file = spark.textFile(hdfs://...)
file.flatMap(line = line.split( )).map(word = (word, 
1)).reduceByKey(_ + _)


The same code can be found in the Quick Start 
http://spark.incubator.apache.org/docs/latest/quick-start.html quide.  
When I follow the steps in my spark-shell (version 0.8.0) it works 
fine.  The reduceByKey method is also shown in the list of 
transformations 
http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#transformations 
in the Spark Programming Guide.  The bottom of this list directs the 
reader to the API docs for the class RDD (this link is broken, BTW).  
The API docs for RDD 
http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD 
does not list a reduceByKey method for RDD. Also, when I try to compile 
the above code in a Scala class definition I get the following compile 
error:


value reduceByKey is not a member of 
org.apache.spark.rdd.RDD[(java.lang.String, Int)]


I am compiling with maven using the following dependency definition:

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.9.3/artifactId
version0.8.0-incubating/version
/dependency

Can someone help me understand why this code works fine from the 
spark-shell but doesn't seem to exist in the API docs and won't compile?


Thanks,
Philip

Re: Where is reduceByKey?

2013-11-07 Thread Philip Ogren

Thanks - I think this would be a helpful note to add to the docs.  I 
went and read a few things about Scala implicit conversions (I'm 
obviously new to the language) and it seems like a very powerful 
language feature and now that I know about them it will certainly be 
easy to identify when they are missing (i.e. the first thing to suspect 
when you see a not a member compilation message.)  I'm still a bit 
mystified as to how you would go about finding the appropriate imports 
except that I suppose you aren't very likely to use methods that you 
don't already know about!  Unless you are copying code verbatim that 
doesn't have the necessary import statements



On 11/7/2013 4:05 PM, Matei Zaharia wrote:
Yeah, this is confusing and unfortunately as far as I know it’s API 
specific. Maybe we should add this to the documentation page for RDD.


The reason for these conversions is to only allow some operations 
based on the underlying data type of the collection. For example, 
Scala collections support sum() as long as they contain numeric types. 
That’s fine for the Scala collection library since its conversions are 
imported by default, but I guess it makes it confusing for third-party 
apps.


Matei

On Nov 7, 2013, at 1:15 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


I remember running into something very similar when trying to perform 
a foreach on java.util.List and I fixed it by adding the following 
import:


import scala.collection.JavaConversions._

And my foreach loop magically compiled - presumably due to a another 
implicit conversion.  Now this is the second time I've run into this 
problem and I didn't recognize it.  I'm not sure that I would know 
what to do the next time I run into this.  Do you have some advice on 
how I should have recognized a missing import that provides implicit 
conversions and how I would know what to import?  This strikes me as 
code obfuscation.  I guess this is more of a Scala question


Thanks,
Philip



On 11/7/2013 2:01 PM, Josh Rosen wrote:
The additional methods on RDDs of pairs are defined in a class 
called PairRDDFunctions 
(https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions). 
 SparkContext provides an implicit conversion from RDD[T] to 
PairRDDFunctions[T] to make this transparent to users.


To import those implicit conversions, use

import org.apache.spark.SparkContext._


These conversions are automatically imported by Spark Shell, but 
you'll have to import them yourself in standalone programs.



On Thu, Nov 7, 2013 at 11:54 AM, Philip Ogren 
philip.og...@oracle.com mailto:philip.og...@oracle.com wrote:


On the front page http://spark.incubator.apache.org/ of the
Spark website there is the following simple word count
implementation:

file = spark.textFile(hdfs://...)
file.flatMap(line = line.split( )).map(word = (word,
1)).reduceByKey(_ + _)

The same code can be found in the Quick Start
http://spark.incubator.apache.org/docs/latest/quick-start.html
quide.  When I follow the steps in my spark-shell (version
0.8.0) it works fine.  The reduceByKey method is also shown in
the list of transformations

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#transformations
in the Spark Programming Guide.  The bottom of this list directs
the reader to the API docs for the class RDD (this link is
broken, BTW). The API docs for RDD

http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
does not list a reduceByKey method for RDD.  Also, when I try to
compile the above code in a Scala class definition I get the
following compile error:

value reduceByKey is not a member of
org.apache.spark.rdd.RDD[(java.lang.String, Int)]

I am compiling with maven using the following dependency definition:

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.9.3/artifactId
version0.8.0-incubating/version
/dependency

Can someone help me understand why this code works fine from the
spark-shell but doesn't seem to exist in the API docs and won't
compile?

Thanks,
Philip

compare/contrast Spark with Cascading

2013-10-28 Thread Philip Ogren



My team is investigating a number of technologies in the Big Data 
space.  A team member recently got turned on to Cascading 
http://www.cascading.org/about-cascading/ as an application layer for 
orchestrating complex workflows/scenarios. He asked me if Spark had an 
application layer?  My initial reaction is no that Spark would not 
have a separate orchestration/application layer.  Instead, the core 
Spark API (along with Streaming) would compete directly with Cascading 
for this kind of functionality and that the two would not likely be all 
that complementary.  I realize that I am exposing my ignorance here and 
could be way off.  Is there anyone who knows a bit about both of these 
technologies who could speak to this in broad strokes?


Thanks!
Philip

Re: set up spark in eclipse

2013-10-28 Thread Philip Ogren


Hi Arun,

I had recent success getting a Spark project set up in Eclipse Juno.  
Here are the notes that I wrote down for the rest of my team that you 
may perhaps find useful:


Spark version 0.8.0 requires Scala version 2.9.3. This is a bit 
inconvenient because Scala is now on version 2.10.3 and the latest and 
greatest versions of the Scala IDE plugin for Eclipse work with the 
latest versions of Scala. Getting a Scala plugin for Eclipse for version 
2.9.3 requires a little bit of effort (but not too much.)The following 
link has information about using Eclipse with Scala 
2.9.3:http://scala-ide.org/download/current.html.Release 3.0.0 supports 
Scala 2.9.3 and there are two update sites (in green boxes). One is for 
Eclipse Juno:


http://download.scala-ide.org/sdk/e38/scala29/stable/site 
http://download.scala-ide.org/sdk/e38/scala29/stable/site%3C/span%3ESo, download 
Eclipse Juno SR2 64-bit for windows from eclipse.org and add the Scala 
IDE using the above update site. If you can't get theupdate site to 
work, then you may need to download a zip file that is a 
downloadableupdate site from 
here:http://download.scala-ide.org/sdk/e38/scala29/stable/update-site.zip



Philip

On 10/26/2013 11:20 PM, Arun Kumar wrote:

Hi

I was trying to set up spark project in eclipse. I used sbt way to 
create a simple spark project as desc in documentation [1].  Then i 
used sbteclipse plugin[2] to create the eclipse project. But I am 
getting errors when I import the project in eclipse. I seem to be 
getting it because the project is using scala 2.10.1 version and there 
is no easy way to use scala 2.9.2. Was anybody else more successfull 
in setting this up. Are there any other IDEs I could use? Can maven be 
used to create the scala/spark project and import it into eclipse?




1. 
http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala

2. https://github.com/typesafehub/sbteclipse

Thanks

Re: unable to serialize analytics pipeline

2013-10-22 Thread Philip Ogren

A simple workaround that seems to work (at least in localhost mode) is 
to mark my top-level pipeline object (inside my simple interface) as 
transient and add an initialize method.  In the method that calls the 
pipeline and returns the results, I simply call the initialize method if 
needed (i.e. if the pipeline object is null.)  This seems reasonable to 
me.  I will try it on an actual cluster next


Thanks,
Philip

On 10/22/2013 11:50 AM, Philip Ogren wrote:


I have a text analytics pipeline that performs a sequence of steps 
(e.g. tokenization, part-of-speech tagging, etc.) on a line of text.  
I have wrapped the whole pipeline up into a simple interface that 
allows me to call it from Scala as a POJO - i.e. I instantiate the 
pipeline, I pass it a string, and get back some objects.  Now, I would 
like to do the same thing for items in a Spark RDD via a map 
transformation.  Unfortunately, my pipeline is not serializable and so 
I get a NotSerializableException when I try this.  I played around 
with Kryo just now to see if that could help and I ended up with a 
missing no-arg constructor exception on a class I have no control 
over.  It seems the Spark framework expects that I should be able to 
serialize my pipeline when I can't (or at least don't think I can at 
first glance.)


Is there a workaround for this scenario?  I am imagining a few 
possible solutions that seem a bit dubious to me, so I thought I would 
ask for direction before wandering about.  Perhaps a better 
understanding of serialization strategies might help me get the 
pipeline to serialize.  Or perhaps there is a way to instantiate my 
pipeline on demand on the nodes through a factory call.


Any advice is appreciated.

Thanks,
Philip

[jira] [Commented] (DERBY-4921) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver

2013-02-04 Thread Philip Ogren (JIRA)


[ 
https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570302#comment-13570302
 ] 

Philip Ogren commented on DERBY-4921:
-

If you change the client driver to behave as the embedded driver as you suggest 
above, then it will not be possible for my example code to work with both the 
Derby driver and the PostgreSQL driver without adding a driver-specific 
conditional.  Please note Knut's observation about PostgreSQL above.  It would 
be much preferable to me to see the embedded driver behave like the client 
driver because that will be much less likely break other people's code 
(including mine.)  

 Statement.executeUpdate(String sql, String[] columnNames)  throws ERROR  
 X0X0F.S exception with EmbeddedDriver
 --

 Key: DERBY-4921
 URL: https://issues.apache.org/jira/browse/DERBY-4921
 Project: Derby
  Issue Type: Bug
  Components: JDBC
Affects Versions: 10.6.2.1
Reporter: Jarek Przygódzki

 Statement.executeUpdate(insertSql, int[] columnIndexes) and 
 Statement/executeUpdate(insertSql,Statement.RETURN_GENERATED_KEYS)  does 
 work, Statement.executeUpdate(String sql, String[] columnNames) doesn't.
 Test program
 import java.sql.Connection;
 import java.sql.DriverManager;
 import java.sql.ResultSet;
 import java.sql.Statement;
 public class GetGeneratedKeysTest {
   static String createTableSql = CREATE TABLE tbl (id integer primary 
 key generated always as identity, name varchar(200));
   static String insertSql = INSERT INTO tbl(name) values('value');
   static String driver = org.apache.derby.jdbc.EmbeddedDriver;
   static String[] idColName = { id };
   public static void main(String[] args) throws Exception {
   Class.forName(driver);
   Connection conn = DriverManager
   .getConnection(jdbc:derby:testDb;create=true);
   conn.setAutoCommit(false);
   Statement stmt = conn.createStatement();
   ResultSet rs;
   stmt.executeUpdate(createTableSql);
   stmt.executeUpdate(insertSql, idColName);
   rs = stmt.getGeneratedKeys();
   if (rs.next()) {
   int id = rs.getInt(1);
   }
   conn.commit();
   }
 }
 Result
 Exception in thread main java.sql.SQLException: Table 'TBL' does not have 
 an auto-generated column named 'id'.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:95)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Util.java:256)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:391)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
   at 
 org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2269)
   at 
 org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1321)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:625)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(EmbedStatement.java:246)
   at GetGeneratedKeysTest.main(GetGeneratedKeysTest.java:23)
 Caused by: java.sql.SQLException: Table 'TBL' does not have an auto-generated 
 column named 'id'.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:119)
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70)
   ... 9 more
 Caused by: ERROR X0X0F: Table 'TBL' does not have an auto-generated column 
 named 'id'.
   at 
 org.apache.derby.iapi.error.StandardException.newException(StandardException.java:303)
   at 
 org.apache.derby.impl.sql.execute.InsertResultSet.verifyAutoGeneratedColumnsNames(InsertResultSet.java:689)
   at 
 org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:419)
   at 
 org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
   at 
 org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1232)
   ... 3 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see

[jira] [Commented] (DERBY-4921) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver

2013-02-01 Thread Philip Ogren (JIRA)


[ 
https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569193#comment-13569193
 ] 

Philip Ogren commented on DERBY-4921:
-

I would like to challenge the decision to close this issue.  I was really 
confused about why I was seeing this behavior today even after finding and 
reading through this issue because I had never seen it before.  I finally 
realized that it was because the behavior between the EmbeddedDriver and 
ClientDriver differs on this exact point and I had always used the client 
driver on my code in question.  I think it is one thing to throw up your hands 
because the spec is vague (and even still, it seems obvious to me that this is 
a bug) but I think it is not really defensible to have the two drivers behave 
differently.  Please reopen and fix!

 Statement.executeUpdate(String sql, String[] columnNames)  throws ERROR  
 X0X0F.S exception with EmbeddedDriver
 --

 Key: DERBY-4921
 URL: https://issues.apache.org/jira/browse/DERBY-4921
 Project: Derby
  Issue Type: Bug
  Components: JDBC
Affects Versions: 10.6.2.1
Reporter: Jarek Przygódzki

 Statement.executeUpdate(insertSql, int[] columnIndexes) and 
 Statement/executeUpdate(insertSql,Statement.RETURN_GENERATED_KEYS)  does 
 work, Statement.executeUpdate(String sql, String[] columnNames) doesn't.
 Test program
 import java.sql.Connection;
 import java.sql.DriverManager;
 import java.sql.ResultSet;
 import java.sql.Statement;
 public class GetGeneratedKeysTest {
   static String createTableSql = CREATE TABLE tbl (id integer primary 
 key generated always as identity, name varchar(200));
   static String insertSql = INSERT INTO tbl(name) values('value');
   static String driver = org.apache.derby.jdbc.EmbeddedDriver;
   static String[] idColName = { id };
   public static void main(String[] args) throws Exception {
   Class.forName(driver);
   Connection conn = DriverManager
   .getConnection(jdbc:derby:testDb;create=true);
   conn.setAutoCommit(false);
   Statement stmt = conn.createStatement();
   ResultSet rs;
   stmt.executeUpdate(createTableSql);
   stmt.executeUpdate(insertSql, idColName);
   rs = stmt.getGeneratedKeys();
   if (rs.next()) {
   int id = rs.getInt(1);
   }
   conn.commit();
   }
 }
 Result
 Exception in thread main java.sql.SQLException: Table 'TBL' does not have 
 an auto-generated column named 'id'.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:95)
   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Util.java:256)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(TransactionResourceImpl.java:391)
   at 
 org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:346)
   at 
 org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2269)
   at 
 org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:81)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1321)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:625)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(EmbedStatement.java:246)
   at GetGeneratedKeysTest.main(GetGeneratedKeysTest.java:23)
 Caused by: java.sql.SQLException: Table 'TBL' does not have an auto-generated 
 column named 'id'.
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(SQLExceptionFactory.java:45)
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(SQLExceptionFactory40.java:119)
   at 
 org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:70)
   ... 9 more
 Caused by: ERROR X0X0F: Table 'TBL' does not have an auto-generated column 
 named 'id'.
   at 
 org.apache.derby.iapi.error.StandardException.newException(StandardException.java:303)
   at 
 org.apache.derby.impl.sql.execute.InsertResultSet.verifyAutoGeneratedColumnsNames(InsertResultSet.java:689)
   at 
 org.apache.derby.impl.sql.execute.InsertResultSet.open(InsertResultSet.java:419)
   at 
 org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(GenericPreparedStatement.java:436)
   at 
 org.apache.derby.impl.sql.GenericPreparedStatement.execute(GenericPreparedStatement.java:317)
   at 
 org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1232)
   ... 3

[jira] [Commented] (DERBY-4921) Statement.executeUpdate(String sql, String[] columnNames) throws ERROR X0X0F.S exception with EmbeddedDriver

2013-02-01 Thread Philip Ogren (JIRA)


[ 
https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569197#comment-13569197
 ] 

Philip Ogren commented on DERBY-4921:
-

I prepared some code that demonstrates the behavior that I ran on the latest 
driver (10.9.1.0).  I don't see away to attach it so I will just paste it here. 
 Hopefully, it will render ok.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

/**
 * to run testClientDriver first start the Derby server with something like:
 *  java -jar derbyrun.jar server start
 *  
 *  to run testEmbeddedDriver change the local variable 'location' to some 
directory that exists on your system.
 */

public class DerbyTest {

public static final String CREATE_TABLE_SQL = create table test_table 
(id int generated always as identity primary key, name varchar(100)); 
public static final String INSERT_SQL = insert into test_table (name) 
values (?);

public static void testClientDriver() throws Exception {
String driver = org.apache.derby.jdbc.ClientDriver;
Class.forName(driver).newInstance();
Connection connection = 
DriverManager.getConnection(jdbc:derby://localhost:1527/testdb;create=true);
connection.prepareStatement(drop table test_table).execute();
connection.prepareStatement(CREATE_TABLE_SQL).execute();
for(String columnName : new String[] {ID, id}) {
PreparedStatement statement = 
connection.prepareStatement(INSERT_SQL, new String[] { columnName});
statement.setString(1, my name);
statement.executeUpdate();
ResultSet results = statement.getGeneratedKeys();
results.next();
int id = results.getInt(1);
System.out.println(id);
}
}

public static void testEmbeddedDriver() throws Exception {
String driver = org.apache.derby.jdbc.EmbeddedDriver;
Class.forName(driver).newInstance();
String location = D:\\temp;
Connection connection = 
DriverManager.getConnection(jdbc:derby:+location+\\testdb;create=true);
connection.prepareStatement(drop table test_table).execute();
connection.prepareStatement(CREATE_TABLE_SQL).execute();
for(String columnName : new String[] {ID, id}) {
PreparedStatement statement = 
connection.prepareStatement(INSERT_SQL, new String[] { columnName});
statement.setString(1, my name);
statement.executeUpdate();
ResultSet results = statement.getGeneratedKeys();
results.next();
int id = results.getInt(1);
System.out.println(column name=+columnName+: +id);
}
}

public static void main(String[] args) throws Exception {
testClientDriver();
testEmbeddedDriver();
}
}


 Statement.executeUpdate(String sql, String[] columnNames)  throws ERROR  
 X0X0F.S exception with EmbeddedDriver
 --

 Key: DERBY-4921
 URL: https://issues.apache.org/jira/browse/DERBY-4921
 Project: Derby
  Issue Type: Bug
  Components: JDBC
Affects Versions: 10.6.2.1
Reporter: Jarek Przygódzki

 Statement.executeUpdate(insertSql, int[] columnIndexes) and 
 Statement/executeUpdate(insertSql,Statement.RETURN_GENERATED_KEYS)  does 
 work, Statement.executeUpdate(String sql, String[] columnNames) doesn't.
 Test program
 import java.sql.Connection;
 import java.sql.DriverManager;
 import java.sql.ResultSet;
 import java.sql.Statement;
 public class GetGeneratedKeysTest {
   static String createTableSql = CREATE TABLE tbl (id integer primary 
 key generated always as identity, name varchar(200));
   static String insertSql = INSERT INTO tbl(name) values('value');
   static String driver = org.apache.derby.jdbc.EmbeddedDriver;
   static String[] idColName = { id };
   public static void main(String[] args) throws Exception {
   Class.forName(driver);
   Connection conn = DriverManager
   .getConnection(jdbc:derby:testDb;create=true);
   conn.setAutoCommit(false);
   Statement stmt = conn.createStatement();
   ResultSet rs;
   stmt.executeUpdate(createTableSql);
   stmt.executeUpdate(insertSql, idColName);
   rs = stmt.getGeneratedKeys

[jira] Commented: (UIMA-1983) JCasGen prouces source files with name shadowing/conflicts

2011-01-04 Thread Philip Ogren (JIRA)


[ 
https://issues.apache.org/jira/browse/UIMA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977607#action_12977607
 ] 

Philip Ogren commented on UIMA-1983:


If I define a type called MyAnnotation in a type system descriptor file, then 
MyAnnotation.java and MyAnnotation_Type.java can be generated.  

The following can be compiler warnings in Java 1.5:

MyAnnotation.java: 

- The field MyAnnotation.typeIndexID is hiding a field from type Annotation
- The field MyAnnotation.type is hiding a field from type Annotation

Both of these can be ignored with @SuppressWarnings(hiding)

MyAnnotation_Type.java:

- The field MyAnnotation_Type.featOkTst is hiding a field from type 
Annotation_Type
- The field MyAnnotation_Type.typeIndexID is hiding a field from type 
Annotation_Type

Both of these can be ignored with @SuppressWarnings(hiding)

When compiling with Java 1.6, I get the following additional warnings:


org.uimafit.type.MyAnnotation.readObject() - Empty block should be documented

This one can be fixed by simply adding a comment to the code block such as 
/*generated code */

org.uimafit.type.AnalyzedText.getTypeIndexID() - The method getTypeIndexID() of 
type AnalyzedText should be tagged with @Override since it actually overrides a 
superclass method

This one can be fixed by adding the @Override annotation.

org.uimafit.type.AnalyzedText_Type.getFSGenerator() - The method 
getFSGenerator() of type AnalyzedText_Type should be tagged with @Override 
since it actually overrides a superclass method

This one can be fixed by adding the @Override annotation.

It may be possible that there are other compiler warnings - but this is what I 
see with my current configuration which is fairly standard, I think.

Thanks.


 JCasGen prouces source files with name shadowing/conflicts
 --

 Key: UIMA-1983
 URL: https://issues.apache.org/jira/browse/UIMA-1983
 Project: UIMA
  Issue Type: Improvement
  Components: Core Java Framework
Reporter: Philip Ogren
Priority: Minor

 When the compiler warnings are set to complain when name shadowing or name 
 conflicts exist, then the source files produced by JCasGen contain many 
 warnings.  It sure would be nice if these files came out pristine rather than 
 having compiler warnings.  Eclipse does not seem to allow for fine grained 
 compiler warning configuration (i.e. to ignore certain warnings for certain 
 source folders or packages) but only works at the project level.
 Therefore, I must either turn these warnings off for the entire project or 
 must ignore the warnings in the type system java files.  
 I'm guessing that this is a side effect of an intentional design decision (re 
 eg typeIndexID) and so I am not that hopeful that this can be fixed but 
 thought I would ask anyways.  
 Thanks,
 Philip

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (UIMA-1983) JCasGen prouces source files with name shadowing/conflicts

2010-12-31 Thread Philip Ogren (JIRA)

JCasGen prouces source files with name shadowing/conflicts
--

 Key: UIMA-1983
 URL: https://issues.apache.org/jira/browse/UIMA-1983
 Project: UIMA
  Issue Type: Improvement
  Components: Core Java Framework
Reporter: Philip Ogren
Priority: Minor


When the compiler warnings are set to complain when name shadowing or name 
conflicts exist, then the source files produced by JCasGen contain many 
warnings.  It sure would be nice if these files came out pristine rather than 
having compiler warnings.  Eclipse does not seem to allow for fine grained 
compiler warning configuration (i.e. to ignore certain warnings for certain 
source folders or packages) but only works at the project level.Therefore, 
I must either turn these warnings off for the entire project or must ignore the 
warnings in the type system java files.  

I'm guessing that this is a side effect of an intentional design decision (re 
eg typeIndexID) and so I am not that hopeful that this can be fixed but thought 
I would ask anyways.  

Thanks,
Philip

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

CAS Editor as a stand-alone application?

2010-12-02 Thread Philip Ogren

I am wondering if it is possible to run the CAS Editor as a stand-alone 
application or if it is only available as a plugin within Eclipse.


Thanks,
Philip

[jira] Commented: (UIMA-1875) ability to visualize and quickly update/add values to primitive features

2010-11-26 Thread Philip Ogren (JIRA)

[
https://issues.apache.org/jira/browse/UIMA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936054#action_12936054
]

Philip Ogren commented on UIMA-1875:

This looks really promising - thanks for investigating! One very simple thing
to try would be to center the tags under the annotations. Additionally, you
could truncate the tags in cases where they still overlap. Maybe it would be
possible to see the full tag when the mouse hovers over it.

This otherwise looks really nice, btw. Even as it is - I would probably
willing to use it now for pos tagging.

ability to visualize and quickly update/add values to primitive features

Key: UIMA-1875
URL: https://issues.apache.org/jira/browse/UIMA-1875
Project: UIMA
Issue Type: New Feature
Components: CasEditor
Reporter: Philip Ogren
Assignee: Jörn Kottmann
Attachments: CasEditor-TagDrawingStrategy.tiff,
CasEditor-TagDrawingStrategyOverlap.tiff

I spent a bit of time evaluating the CAS Editor recently and have the
following suggestion. It is common to have annotation tasks in which adding
a primitive value to a annotation feature happens frequently. Here's one
common annotation task - part-of-speech tagging. Usually, the way this task
is performed is a part-of-speech tagger is run on some data and a
part-of-speech tag is added as a string value to a feature of a token type.
The annotator's task is then to look at the part-of-speech tags and make sure
they look right and fix the ones that aren't. However, the only way to see
the part-of-speech tag is by clicking on the token annotation in the text and
view the value of the feature in the editor view. This makes the tool really
unusable for this annotation task. What would be really nice is to be able
to display the part-of-speech tags above or below the tokens so that the
linguist can scan the sentence with its tags and quickly find the errors.
There are a number of other annotation tasks that have similar requirements.
For example, named entities usually have category labels which would be nice
to display. Word sense disambiguation data is also similar.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Training and Learning

2010-11-12 Thread Philip Ogren


Borobudur,

Do you mean training in the machine learning sense?  If so, UIMA does 
not directly support any notion of training and using statistical 
classifiers.  You might check out ClearTK 
http://cleartk.googlecode.com which is a UIMA-based project that 
provides support for a number of machine learning classifiers and 
supports feature extraction, training, and classification.


Philip

On 11/12/2010 7:39 AM, borobudur wrote:

Hi, I had a look at the UIMA architecture and I was asking myself if there is a
konzept for a training and learning phase in UIMA.
What I learned is that UIMA chains Analysis Engines together. There is no two
phase concept like training and extraction in UIMA, isn't it?
Thanks
borobudur



-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1153 / Virus Database: 424/3251 - Release Date: 11/11/10

Re: AW: Compare two CASes

2010-11-05 Thread Philip Ogren


Hi Armin,

I am glad you have found uimaFIT useful.  I'm sorry about the lack of 
information about the dependency.  It is documented in some sense in 
the pom.xml file - but that is not too helpful if you are not building 
with maven and you simply download the .jar file available from the 
downloads page.  Thank you for bringing this to my attention.


Philip

On 11/4/2010 11:35 PM, armin.weg...@bka.bund.de wrote:

Hi Philip,

thanks, that really helps. Looking into the uimaFIT source code showed that 
UIMA is missing some documentation that would be really useful.

I think I will use uimaFIT from now on. UimaFit should be part of UIMA itself. 
But you could have mentioned that umiaFIT depends on Apache Commons somewhere.

Thanks,

Armin

-Ursprüngliche Nachricht-
Von: user-return-3244-armin.wegner=bka.bund...@uima.apache.org 
[mailto:user-return-3244-armin.wegner=bka.bund...@uima.apache.org] Im Auftrag 
von Philip Ogren
Gesendet: Montag, 1. November 2010 16:20
An: user@uima.apache.org
Betreff: Re: Compare two CASes

Hi Armin,

I have put together some example code together using uimaFIT to address this 
very common scenario.  There's a wiki page that provides an entry point here:

http://code.google.com/p/uimafit/wiki/RunningExperiments

Hope this helps.

Philip

On 11/1/2010 2:09 AM, armin.weg...@bka.bund.de wrote:

Hi,

how to compare two CASes? I like to compute the f-value for some
annotations of an actual CAS compared to a target CAS.

To compare two CASes I need them both in one analysis engine. But an
analysis engine processes only one CAS at a time. Have I to merge the
two CASes first to get a CAS with a target view and a actual view? How
do I merge two CASes?

Or is an analysis engine the wrong place to do this?

Thanks,

Armin




-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1153 / Virus Database: 424/3230 - Release Date: 10/31/10






-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1153 / Virus Database: 424/3238 - Release Date: 11/04/10

Re: Compare two CASes

2010-11-01 Thread Philip Ogren


Hi Armin,

I have put together some example code together using uimaFIT to address 
this very common scenario.  There's a wiki page that provides an entry 
point here:


http://code.google.com/p/uimafit/wiki/RunningExperiments

Hope this helps.

Philip

On 11/1/2010 2:09 AM, armin.weg...@bka.bund.de wrote:

Hi,

how to compare two CASes? I like to compute the f-value for some
annotations of an actual CAS compared to a target CAS.

To compare two CASes I need them both in one analysis engine. But an
analysis engine processes only one CAS at a time. Have I to merge the
two CASes first to get a CAS with a target view and a actual view? How
do I merge two CASes?

Or is an analysis engine the wrong place to do this?

Thanks,

Armin




-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1153 / Virus Database: 424/3230 - Release Date: 10/31/10

Re: maven src directory for generated code

2010-10-21 Thread Philip Ogren

Marshall,

Thank you for this helpful hint. I was just now following up on
Richard's tips and was trying to figure out how to get Eclipse to
recognize the new source directory. I think I now have a working solution!

I will send it around shortly.

Philip

On 10/21/2010 9:03 AM, Marshall Schor wrote:

Hi,

I agree - lots of source generation tooling generates to
target/generated-sources ...

This has the advantage that the normal clean operation removes these.

Using m2eclipse Eclipse plugin for maven - by default it will miss these
generated directories the first time you import a project as a Maven project.
However, the recovery is simple, and only needs doing once: right click the
project and select Maven - update project configuration.

-Marshall

On 10/21/2010 2:30 AM, Richard Matthias Eckart de Castilho wrote:

Hello Philip,

So, I would much rather have the generated
code put into a different source folder that has only generated code in
it. Something like src/generated/java or src/output/java or whatever
the convention is. My question is whether or not there is a convention
and if so could someone point me to a project that does something like this?

my understanding is that target/generated-sources/toolname is appropriate -
I've seen this in a couple of instances. This is also what is mentioned in the
NetBeans wiki on Maven best practices:

If your project contains generated source roots that need to appear in the project's source
path, please make sure that the Maven plugin generating the sources generates them in the
target/generated-sources/toolname directory wheretoolname is folder
specific to the Maven plugin used and acts as source root for the generated sources. Most
common maven plugins currently follow this pattern in the default configuration.

(Source: http://wiki.netbeans.org/MavenBestPractices#Open_existing_project)

There is also a Maven plugin which allows to add folders to the Maven source
folders list. This may help if e.g. Eclipse does not properly pick up the
folder:

http://mojo.codehaus.org/build-helper-maven-plugin/usage.html

Cheers,

Richard

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.862 / Virus Database: 271.1.1/3210 - Release Date: 10/21/10
00:34:00

maven src directory for generated code

2010-10-20 Thread Philip Ogren

We are using maven as the build platform for our uima-based projects.  
As part of the build org.apache.uima.tools.jcasgen.Jg is invoked to 
generate java files for the type system.  This is performed in the 
process-resources phase using a plugin configuration previously 
discussed on this list.  We have the target directory for the generated 
java files set to src/main/java.  However, we make it our practice not 
to check in these files into our repository because they change every 
time they get generated - and because they are generated there's no 
point in checking them in in the first place.  As the type system gets 
bigger it gets more and more annoying to have src/main/java filled with 
directories that are under svn:ignore.  This is particularly true when a 
sub-project imports a type system from a different project and the 
packaging structure of the generated source code is much different than 
from the sub-project's code.  So, I would much rather have the generated 
code put into a different source folder that has only generated code in 
it.  Something like src/generated/java or src/output/java or whatever 
the convention is.  My question is whether or not there is a convention 
and if so could someone point me to a project that does something like this?


Thanks,
Philip

uimaFIT 1.0.0 released

2010-07-16 Thread Philip Ogren

We are pleased to announce the release of uimaFIT 1.0.0 - a library that 
provides factories, injection, and testing utilities for UIMA.  The 
following list highlights some of the features uimaFIT provides:


 * Factories: simplify instantiating UIMA components programmatically 
without descriptor files.  For example, to instantiate an AnalysisEngine 
a call like this could be made: 
AnalysisEngineFactory.createPrimitive(MyAEImpl.class, myTypeSystem, 
paramName, paramValue)
 * Injection: handles the binding of configuration parameter values to 
the corresponding member variables in the analysis engines and handles 
the binding of external resources.  For example, to bind a configuration 
parameter just annotate a member variable with @ConfigurationParameter.  
Then add one line of code to your initialize method - 
ConfigurationParameterInitializer.initialize(this, uimaContext).  This 
is handled automatically if you extend the uimaFIT 
JCasAnnotator_ImplBase class.
 * Testing: uimaFIT simplifies testing in a number of ways described in 
the documentation.  By making it easy to instantiate your components 
without descriptor files a large amount of difficult-to-maintain and 
unnecessary XML can be eliminated from your test code.   This makes 
tests easier to write and maintain.  Also, running components as a 
pipeline can be accomplished with a method call like this:  
SimplePipeline.runPipeline(reader, ae1, ..., aeN, consumer1, ... consumerN)


uimaFIT is licensed with Apache Software License 2.0 and is available 
from Google Code at:


http://uimafit.googlecode.com
http://code.google.com/p/uimafit/wiki/Documentation

uimaFIT is available via Maven Central.  If you use maven for your build 
environment, then you can add uimaFIT as a dependency to your pom.xml 
file with the following:


dependency
groupIdorg.uimafit/groupId
artifactIduimafit/artifactId
version1.0.0/version
/dependency

uimaFIT is a collaborative effort between the Center for Computational 
Pharmacology at the University of Colorado Denver, the Center for 
Computational Language and Education Research at the University of 
Colorado at Boulder, and the Ubiquitous Knowledge Processing (UKP) Lab 
at the Technische Universität Darmstadt.  uimaFIT is extensively used by 
projects being developed by these groups.


The uimaFIT development team is:

Philip Ogren, University of Colorado, USA
Richard Eckart de Castilho, Technische Universität Darmstadt, Germany
Steven Bethard, Stanford University, USA

with contributions from Fabio Mancinelli, Chris Roeder, Philipp Wetzler, 
and Torsten Zesch.


Please address questions to uimafit-us...@googlegroups.com.  uimaFIT 
requires Java 1.5 or higher and UIMA 2.3.0 or higher.

why use UIMA logging facility?

2010-03-03 Thread Philip Ogren

I am involved in a discussion about how to do logging in uima.  We were 
looking at section 1.2.2 in the tutorial for some motivation as to why 
we would use the built-in UIMA logging rather than just using e.g. log4j 
directly - but there doesn't seem to be any.  Could someone give us some 
intuition why it matters which logging library we use?  (I have my own 
suspicions - but that's all they are...)


Thanks,
Philip

what is the difference between the base CAS view and _InitialView?

2009-12-30 Thread Philip Ogren

I have a component that takes text from the _InitialView view, creates 
a second view, and posts a modified version of the text to the second 
view.  I had a unit test that was reading in a JCas from an XMI file and 
running the JCas through my annotator and testing that the text was 
correctly posted to the second view.  Everything was fine.  Later, it 
occurred to me that I should add SofaCapabilities to my annotator which 
I did.  However, my test broke.  The reason seems to be because the JCas 
I get back from the xmi file has the text in the _InitialView but the 
jCas passed into my annotator's process method is no longer the 
_InitialView.  I tracked this down in the debugger and discovered that 
this is due to the following lines of code in PrimitiveAnalysisEngine_impl:


   // Get the right view of the CAS. Sofa-aware components get the 
base CAS.
   // Sofa-unaware components get whatever is mapped to the 
_InitialView.

   CAS view = ((CASImpl) aCAS).getBaseCAS();
   if (!mSofaAware) {
 view = aCAS.getView(CAS.NAME_DEFAULT_SOFA);
   }


So, to make my test work again I simply injected the following line of 
code near the top of the process method of my annotator:


   jCas = jCas.getView(CAS.NAME_DEFAULT_SOFA);
  

This made my problem go away.  However, this seems quite 
unsatisfactory!  Can someone explain to me what the difference is 
between the base CAS that I am now getting in my process method and 
the _InitialView CAS that I was getting before?  Are my annotators 
supposed to be aware of this difference.  Any advice and/or insight on 
this would be appreciated.


Thanks,
Philip

Re: Dynamic Annotators

2009-11-15 Thread Philip Ogren


Ram,

You might check out the uutuc project at:

http://code.google.com/p/uutuc/

The main goal of this project is to make it easier to dynamically 
describe and instantiate uima components.  The project started off as 
utility classes for unit testing - but has really become a dynamic 
descriptor framework.  We need to revamp the documentation to make this 
clearer.  As a simple example you can instantiate an analysis engine 
with something like:


TypeSystemDescription tsd = 
TypeSystemDescriptionFactory.createTypeSystemDescription(MyType.class, 
MyType2.class, ...);
AnalysisEngineFactory.createPrimitive(MyComponent.class, tsd, 
myConfigurationParameterName, configParamValue, ...);


It's not a complete library but it has come a long ways since we 
released it last spring. 


Cheers,
Philip


Ram Mohan Yaratapally wrote:

Hi

I am currently evaluating Apache UIMA for one of our requirement. Can
somebody clarify me whether it is possible to load the Annotators
and Feature Properties dynamically?

- Ram

  




No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.704 / Virus Database: 270.14.61/2497 - Release Date: 11/11/09 12:41:00

Re: Problem reconfiguring after setting a config parameter value

2009-07-27 Thread Philip Ogren


Girish,

I have done exactly the same thing as you minus step 2 below without any 
problems.  The only caveat being that this didn't seem (as I recall) to 
trigger my analysis engines initialize() method and so I had to reread 
the parameter in my analysis engine's process() method. 


I don't know how helpful this kind of feedback is

Philip


 


Chavan, Girish wrote:

Hi All,

I am still stuck here. My backup option is to alter the descriptor xml before 
building an analysis engine with it, but it is kinda hacky, so was hoping for a 
cleaner solution. Can any of the developers chime in?

Thanks,
___
Girish Chavan, MSIS
Department of Biomedical Informatics (DBMI)
University of Pittsburgh



-Original Message-
From: Chavan, Girish [mailto:chav...@upmc.edu] 
Sent: Friday, July 24, 2009 11:01 AM

To: uima-user@incubator.apache.org
Subject: Problem reconfiguring after setting a config parameter value

Hi All,

I am using the following code to set the value for the 'ChunkCreaterClass' 
parameter. I had  to add line (2) because without it I was getting a 
UIMA_IllegalStateException due to the session being null. I am not quite sure 
if that is the way you set a new session. In any case I stopped getting that 
error. But now I get a nullpointerexception error ( trace shown below the code).
I had a breakpoint set in the ConfigurationManagerImplBase.validateConfigurationParameterSettings 
method to see what was happening. It entered the method multiple times to validate each AEs params. 
 The first set of calls was triggered after line (1). And it went through smoothly. The next set 
was triggered after line (4) which threw an exception immediately for the / context. 
Within the implementation of the validateConfigurationParameterSettings method. I see that 
ConfigurationParameterDeclarations comes up null for the / context.

Not quite sure what I am doing wrong. Any ideas??

CODE:
(1)AnalysisEngine ae =  UIMAFramework.produceAnalysisEngine(specifier);
(2)ae.getUimaContextAdmin().setSession(new Session_impl());
(3)ae.setConfigParameterValue(ChunkCreaterClass, Something);
(4)ae.reconfigure();

Exception:
java.lang.NullPointerException
at 
org.apache.uima.resource.impl.ConfigurationManagerImplBase.validateConfigurationParameterSettings(ConfigurationManagerImplBase.java:488)
at 
org.apache.uima.resource.impl.ConfigurationManagerImplBase.reconfigure(ConfigurationManagerImplBase.java:235)
at 
org.apache.uima.resource.ConfigurableResource_ImplBase.reconfigure(ConfigurableResource_ImplBase.java:70)



Thanks.
___
Girish Chavan, MSIS
Department of Biomedical Informatics (DBMI)
University of Pittsburgh
UPMC Cancer Pavilion, 302D
5150 Centre Avenue
Pittsburgh, PA  15232
Office: 412-623-4084
Email: chav...@upmc.edu
  




No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.392 / Virus Database: 270.13.27/2258 - Release Date: 07/24/09 05:58:00

Re: Get the path

2009-07-22 Thread Philip Ogren

One thing that you might consider doing is putting the path information
into its own view. That is, create a new view and set its document path
to be the path/uri. One advantage of this is that if you have a
CollectionReader that is otherwise type system agnostic you don't have
to pollute it with a single type for holding this information. This may
not be the UIMA way - but we felt for this piece of information that
this was a reasonable thing to do. The following class facilitates this:

http://cleartk.googlecode.com/svn/trunk/doc/api/src-html/org/cleartk/util/ViewURIUtil.html

Here is our type system agnostic file system collection reader which
makes use of it:

http://cleartk.googlecode.com/svn/trunk/doc/api/src-html/org/cleartk/util/FilesCollectionReader.html

Hope this helps.

Philip

Adam Lally wrote:

On Tue, Jul 21, 2009 at 4:25 AM, Radwen ANIBAarad...@gmail.com wrote:

Hello every one,

Well when playing a little bit with JCAS I was wondering how to get directly
the path to the document treated within AE without expressing it directly.

What I want to do is to get the path and the document name eg
/here/in/this/folder/Document.txt

Is there any extension of arg0.getDocumentText() method or something like ?

This information isn't build into the framework, but there are some
examples showing how to do it. There's a type called
SourceDocumentInformation that is populated by the
FileSystemCollectionReader and then used in the XMI Writer CAS
Consumer (among others).

-Adam

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.392 / Virus Database: 270.13.20/2250 - Release Date: 07/20/09 06:16:00

Re: model file problen in opennlp uima wrapper

2009-04-16 Thread Philip Ogren

It may be worth pointing out that there is a very nice set of uima 
wrappers for OpenNLP available from their sourceforge cvs repository.  
See http://opennlp.cvs.sourceforge.net/opennlp/.  While this is still a 
work in progress - it is *much* nicer than the example wrappers that 
ship with UIMA. 


Burn Lewis wrote:

That probably means the OpenNLP library was compiler with a newer version of
Java.  Try switching to a higher version number.  (The message usually
indicates the version number, e.g. 6 for Java 1.6)

Burn.

  




No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 8.5.287 / Virus Database: 270.11.57/2060 - Release Date: 04/15/09 06:34:00

uutuc - uima unit test utility code

2009-02-23 Thread Philip Ogren



We have posted a light-weight set of utility classes that ease the 
burden of unit testing UIMA components.  The project is located at:


http://uutuc.googlecode.com/

and is licensed under ASL 2.0. 

There is very little documentation for this library at the moment - just 
a bare-bones getting started tutorial that shows a very simple example 
of unit testing an analysis engine (RoomNumberAnnotator).  A fuller 
tutorial (along with javadocs) are forthcoming.  However, uutuc does use 
itself to unit-test the code (over 90% code coverage) and is now used by 
ClearTK which has more than 250 tests in it.  Both of these projects 
provide good examples of uutuc usage.  It is by no means a complete 
library - but we have found it very useful for our project(s). 
Marshall, in response to your earlier email - if it makes sense to fold 
this into a UIMA project like uimaj-test-util, then I think that would 
be great.  I thought it might be easier initially to clean up what we 
had written for cleartk and release it as a separate project and see if 
anyone finds it useful.

proposal for UIMA unit test utility code

2009-02-05 Thread Philip Ogren

We have assembled some misc. utility methods that make unit testing
easier to support our UIMA-based project ClearTK. I have come across
several scenarios now where I wish that this code was available as a
separate project so that I don't have to create a dependency on our
entire ClearTK project when I only need a few utility classes for
writing unit tests. So, I am thinking about spinning off a separate
project that just has this unit-test specific code in it. There really
isn't that much code there (I think it's all shoved into one class at
the moment) and would really be just a seed for getting something started.

One thing the code does is simplify initialization of UIMA components
programmatically such that you don't have to use descriptor files in
your unit tests. A unit test might start with something like:

AnalysisEngine engine = TestsUtil.getAnalysisEngine(MyAnnotator.class,
TestsUtil.getTypeSystem(Document.class, Token.class, Sentence.class));

JCas jCas = engine.newJCas();

and then you can go about populating the JCas and testing its contents.

The current code is located here:

http://code.google.com/p/cleartk/source/browse/trunk/test/src/org/cleartk/util/TestsUtil.java

Some example unit tests based on this code can be found here:

http://code.google.com/p/cleartk/source/browse/trunk/test/src/org/cleartk/example/ExamplePOSHandlerTests.java

Is this something that is of interest? Is there already code in UIMA
that helps with this sort of thing?

Thanks,
Philip

updating payloads

2008-12-31 Thread Philip Ogren

Is it possible to update the payloads of an existing index?  I having 
troubles finding any mention of this on the mailing list archives and it 
is not obvious that this is possible from the api.  I do not want to 
change the size of the payloads - just update the values.  My payloads 
values depend on values returned from reader.docFreq().  I don't want to 
update often - just once - and I would settle for a very hackish way of 
doing it if needed.  I suppose in the worst case scenario I could update 
the .cfs file directly - but now I expose my naivety. 


Any pointers are greatly appreciated.

Thanks,
Philip

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

abandoning Groovy

2008-05-29 Thread Philip Ogren


Marshall,

This may be frustrating/annoying feedback.  Last summer I sent a few 
emails to the list about unit testing uima components using groovy and 
probably said a few other positive things about groovy.  We have since 
abandoned groovy for a variety of reasons.  Here are a few:


- unit tests on code that do anything slightly useful parameterized 
types did not work correctly on mac/unix
- eclipse slows down because it seems user actions are constantly 
waiting for groovy to compile

- refactoring code is not as seamless
- it adds one more moving part to e.g. download/install instructions and 
makes distribution incrementally more complicated
- the eclipse plugin is not keeping pace with versions of groovy - so we 
felt stuck with groovy 1.0. 

We really aren't groovy experts and so there may be reasonable 
counterpoints to these annoyances that we encountered.  But any 
perceived advantage of making groovy groovy was outweighed by our 
general annoyance with this language.  I hope others find these comments 
useful.


Philip

Marshall Schor wrote:
I just committed a very experimental version of the Open Calais 
annotator to the sandbox, called OpenCalaisGroovy.  As you can 
probably guess from the title, it uses the programming langauge 
groovy (google groovy) - which is an extension of Java.  The details 
about trying this out are in the Readme.txt file.


At this point, the annotator doesn't run on real results from Open 
Calais, but rather on test data that they make available for download 
- about 333 news stories.  There is a special collection reader that 
reads the files in this set and sets up a CAS with the input text and 
a new FeatureStructure containing what OpenCalais would have responded 
with.  A separate annotator takes the response as specified, and 
constructs Entities, Relations, and Instances, following the model 
that Open Calais is using.
This implementation uses the JCas, and there is a complete set of JCas 
types defined for all the Entities and Relations that Open Calais 
produces.


This is my first try at doing things in Groovy - so it may not be done 
in the ideal groovy way :-) - but it's interesting code to look at.


The Readme tells how to run it - you'll need the Eclipse Groovy plugin 
(maven build is not yet working due to bugs in the maven groovy 
builder supposedly fixed in a SNAPSHOT release, but I'm waiting for 
the next release candidate - this is still early in the evolution cycle).


Feedback / improvement welcomed and appreciated!

-Marshall

Re: Type Priorities

2008-03-13 Thread Philip Ogren


Katrin,

Yes.  There is a penalty for iterating through all the annotations of a 
given type.  Imagine you have a token annotation and a document with 10K 
tokens (not uncommon). 

We wrote a method that doesn't have this performance penalty and 
bypasses the type priorities. 


Please see:

http://cslr.colorado.edu/ClearTK/index.cgi/chrome/site/api/src-html/edu/colorado/cleartk/util/AnnotationRetrieval.html#line.237

Philip


Katrin Tomanek wrote:

Hi Thilo,

Actually, I am using the subiterator functionality and I need to 
have type priorities to be set. But I don't want the user of a 
component to be able to alter the type priorities by modifying the 
descriptors where type priorities typically are set.


is that necessary?  The user mustn't change the type system either,
else the annotator will no longer work.  Wouldn't you say it's part
of the contract that such metadata is not changed by the user?
well, yeah. Could see it that way. But if I know at time of writing 
the component which types it will use and how the priorities should 
look like, why should I risk to let the user violate that contract ?


But to come back to the true problem: I am asking about type 
priorities because I am using the subiterator. For me, type priorities 
are necessary when the types I am interested in are of exactly the 
same offset as the type with which I constrain my subiterator. So, the 
question for me is: shall I use the subiterator (and define priorities 
in the compoennts descriptor) or shall I write my own function(see 
below) doing what I want without the type priority problems:


- SNIP 
// gives me a list of all abbreviations which are within my entity
public ArrayListAbbreviation getAbbreviations(Entity entity, 
JFSIndexRepository index) {

ArrayListAbbreviation abbrev = new ArrayListAbbreviation();
int StartOffset = entity.getBegin();
int EndOffset = entity.getEnt();
Iterator iter = index.getAnnotationIndex(Abbreviation.type).iterator();
while (iter.hasNext()) {
Abbreviation currAbbrev = (Abbreviation) iter.next();
if (currAbbrev.getBegin() = startOffset  
currAbbrev.getEnd() = endOffset) {

abbrev.add(currAbbrev);
}
}
return abbrev;
}
- SNIP 

Do you see an efficiency disadvantage when using the above function 
instead of the subiterator?


Katrin

Re: How to test a CollectionReader (in Groovy)

2007-08-31 Thread Philip Ogren

I didn't follow the thread closely so I may be wandering here - but I 
thought I would volunteer my working strategy for testing collection 
readers in Groovy even though it may be overly simplistic for many 
situations. 


My unit tests for our collection readers start off with one line:

JCas jCas = TestsUtil.processCR(desc/test/myCRdesc.xml, 0)  



followed immediately by assertions of what I expect to be in the JCas.

The method TestsUtil.processCR looks like this:

static JCas processCR(String descriptorFileName, int documentNumber)
   {
   XMLInputSource xmlInput = new XMLInputSource(new 
File(desc/annotators/EmptyAnnotator.xml))
   ResourceSpecifier specifier = 
UIMAFramework.getXMLParser().parseResourceSpecifier(xmlInput)
   AnalysisEngine analysisEngine = 
UIMAFramework.produceAnalysisEngine(specifier)
   JCas jCas = analysisEngine.newJCas()
   xmlInput = new XMLInputSource(new File(descriptorFileName))
   specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(xmlInput)
   CollectionReader collectionReader = UIMAFramework.produceCollectionReader(specifier)   


   for(i in 0..documentNumber)
   {
   jCas.reset()
   collectionReader.getNext(jCas.getCas())
   }
   return jCas
   }



Where EmptyAnnotator.xml is a descriptor file for an analysis engine 
that does nothing as follows:


public class EmptyAnnotator extends JCasAnnotator_ImplBase{
   public void process(JCas jCas) throws AnalysisEngineProcessException{
   //this annotator does nothing!
   }
}


I hope this is helpful.

Re: ClassCastException thrown when using subiterator and moveTo()

2007-06-28 Thread Philip Ogren

btw, the work around I had posted is *much* slower than the fixed 
subiterator method - for both creating the iterator and iterating 
through it.  More than an order of magnitude slower (roughly 15x's 
slower).  Thanks for the fix!



Thilo Goetz wrote:

   public static FSIterator getWindowIterator(JCas jCas, Annotation
windowAnnotation, Type type)
   {
   ConstraintFactory constraintFactory = jCas.getConstraintFactory();
 FeaturePath beginFeaturePath = jCas.createFeaturePath();
   beginFeaturePath.addFeature(type.getFeatureByBaseName(begin));
   FSIntConstraint intConstraint =
constraintFactory.createIntConstraint();
   intConstraint.geq(windowAnnotation.getBegin());
   FSMatchConstraint beginConstraint =
constraintFactory.embedConstraint(beginFeaturePath, intConstraint);

   FeaturePath endFeaturePath = jCas.createFeaturePath();
   endFeaturePath.addFeature(type.getFeatureByBaseName(end));
   intConstraint = constraintFactory.createIntConstraint();
   intConstraint.leq(windowAnnotation.getEnd());
   FSMatchConstraint endConstraint =
constraintFactory.embedConstraint(endFeaturePath, intConstraint);

   FSMatchConstraint windowConstraint =
constraintFactory.and(beginConstraint,endConstraint);
   FSIndex windowIndex = jCas.getAnnotationIndex(type);
   FSIterator windowIterator =
jCas.createFilteredIterator(windowIndex.iterator(), windowConstraint);

   return windowIterator;
   }



Thilo Goetz wrote:
  

That's a bug.  The underlying implementation of the two
iterator types you mention is totally different, hence
you see this only in one of them.  Any chance you could
provide a self-contained test case that exhibits this?

--Thilo

Philip Ogren wrote:
 


I am having difficulty with using the FSIterator returned by the
AnnotationIndex.subiterator(AnnotationFS) method.
The following is a code fragment:

AnnotationIndex annotationIndex = jCas.getAnnotationIndex(tokenType);
FSIterator tokenIterator =
annotationIndex.subiterator(sentenceAnnotation);
annotationIterator.moveTo(tokenAnnotation);


Here is the relevant portion of the stack trace:

java.lang.ClassCastException: edu.colorado.cslr.dessert.types.Token
at java.util.Collections.indexedBinarySearch(Unknown Source)
at java.util.Collections.binarySearch(Unknown Source)
at org.apache.uima.cas.impl.Subiterator.moveTo(Subiterator.java:224)


If I change the second line to the following, then I do not have any
problems with an exception being thrown.

FSIterator tokenIterator = annotationIndex.iterator();


Is this a bug or some misunderstanding on my part of how subiterator
should work?

Thanks,
Philip

Unit testing with Groovy

2007-06-27 Thread Philip Ogren

Great! 

I have found that a great way to get started with Groovy is for unit 
testing.  I borrowed a few lines of Java code for obtaining a JCas that 
I saw on an earlier post (from Marshall I think) that I used to create a 
little util class that has a single method:


   static JCas process(String descriptorFileName, String textFileName)
   {
   File textFile = new File(textFileName) 
   String text

   if(textFile.exists())
   text = textFile.text
   else
   text = textFileName
  
   XMLInputSource xmlInput = new XMLInputSource(new 
File(descriptorFileName))
   ResourceSpecifier specifier = 
UIMAFramework.getXMLParser().parseResourceSpecifier(xmlInput)
   AnalysisEngine analysisEngine = 
UIMAFramework.produceAnalysisEngine(specifier)

   JCas jCas = analysisEngine.newJCas()

   jCas.setDocumentText(text)
   analysisEngine.process(jCas)
   return jCas
   }

With this I can make a unit test very easily:

   void testTokenizer()
   {
   println running token tests ...
   JCas jCas = TestsUtil.process(tokenDescriptorFile, 
data/test/docs/sampletext.txt)   
   FSIndex tokenIndex = jCas.getAnnotationIndex(Token.type)

   assert tokenIndex.size() == 132
   ...
   }

I have also started exploring writing uima components with Groovy.  I 
recently put together a file collection reader that uses Groovy's Ant 
builder which allows one to define includes and excludes for a FileSet 
in a descriptor file the same way you would in a build.xml file.  I 
haven't made it available yet because it needs a README and it is also 
dog slow on large corpora.  However, it is 100 lines of code and 
provides nifty functionality for smaller data sets.  My point is simply 
to give an example of what is possible.  When I understand the CAS 
better, it might be cool think about creating a Groovy builder that 
would be a more dynamic solution than JCas - but that is just a dream at 
the moment given my time and (lack of) expertise of these two 
technologies. 




Thilo Goetz wrote:

Fixed for 2.2, thanks for reporting and providing the test case.
This Groovy stuff looks pretty cool, I'll have to check it out
some more...

--Thilo

Thilo Goetz wrote:
  

The easiest place to create a test case is on Jira.  Open a Jira
bug for UIMA, and you can attach the test case.  When you do that,
please check the box that says something like, ok to include in
Apache code (so we can check it in and use it as regression test).

Groovy, hm.  Never used it before.  If it doesn't take me more than
5 min to set up in Eclipse, and I can still debug, not a problem ;-)

--Thilo

Philip Ogren wrote:


Yes.  I will throw one together.  Would you mind if I used Groovy for
this - or is that going to be annoying?  Let me know.  Also, from an
earlier email I saw on this list it seems that attachments are a
problem.  Is there some place where I could directly load a test case?

In the mean time, here is a work around I just put together (I'm still
unit testing the code that uses this - so I'm not certain this is bug
free):

   public static FSIterator getWindowIterator(JCas jCas, Annotation
windowAnnotation, Type type)
   {
   ConstraintFactory constraintFactory = jCas.getConstraintFactory();
 FeaturePath beginFeaturePath = jCas.createFeaturePath();
   beginFeaturePath.addFeature(type.getFeatureByBaseName(begin));
   FSIntConstraint intConstraint =
constraintFactory.createIntConstraint();
   intConstraint.geq(windowAnnotation.getBegin());
   FSMatchConstraint beginConstraint =
constraintFactory.embedConstraint(beginFeaturePath, intConstraint);

   FeaturePath endFeaturePath = jCas.createFeaturePath();
   endFeaturePath.addFeature(type.getFeatureByBaseName(end));
   intConstraint = constraintFactory.createIntConstraint();
   intConstraint.leq(windowAnnotation.getEnd());
   FSMatchConstraint endConstraint =
constraintFactory.embedConstraint(endFeaturePath, intConstraint);

   FSMatchConstraint windowConstraint =
constraintFactory.and(beginConstraint,endConstraint);
   FSIndex windowIndex = jCas.getAnnotationIndex(type);
   FSIterator windowIterator =
jCas.createFilteredIterator(windowIndex.iterator(), windowConstraint);

   return windowIterator;
   }



Thilo Goetz wrote:
  

That's a bug.  The underlying implementation of the two
iterator types you mention is totally different, hence
you see this only in one of them.  Any chance you could
provide a self-contained test case that exhibits this?

--Thilo

Philip Ogren wrote:
 


I am having difficulty with using the FSIterator returned by the
AnnotationIndex.subiterator(AnnotationFS) method.
The following is a code fragment:

AnnotationIndex annotationIndex = jCas.getAnnotationIndex(tokenType);
FSIterator tokenIterator =
annotationIndex.subiterator(sentenceAnnotation);
annotationIterator.moveTo(tokenAnnotation);


Here is the relevant

typeSystemInit() method in CasAnnotator_ImplBase but not JCasAnnotator_ImplBase

2007-06-26 Thread Philip Ogren



Thilo had pointed me towards the method typeSystemInit() in a recent 
posting as a way of getting type system information in an annotator.  Is 
there a reason that this method exists in CasAnnotator_ImplBase but not 
JCasAnnotator_ImplBase?  Or is this an omission?  My intuition is that 
might have been left out on purpose because when using the JCas you 
typically would have a fixed type system that is determined at compile 
time. Still, it seems useful to be able to override typeSystemInit() 
even if it is only called once.  I didn't really think carefully about 
which Annotator_ImplBase I should use.  Are there far reaching 
consequences I should consider - or is it really just about convenience 
at the API level? 


Thanks,
Philip

[jira] Updated: (UIMA-464) ClassCastException thrown when using subiterator and moveTo()

2007-06-21 Thread Philip Ogren (JIRA)


 [ 
https://issues.apache.org/jira/browse/UIMA-464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Ogren updated UIMA-464:
--

Attachment: UIMA-464.zip

Please see README in the top level of the directory.  

If there is something I could have done to make this fit better into your 
regression test suite, then please let me know.



 ClassCastException thrown when using subiterator and moveTo()
 -

 Key: UIMA-464
 URL: https://issues.apache.org/jira/browse/UIMA-464
 Project: UIMA
  Issue Type: Bug
  Components: Core Java Framework
Affects Versions: 2.1
 Environment: N/A
Reporter: Philip Ogren
 Attachments: UIMA-464.zip


 I am having difficulty with using the FSIterator returned by the 
 AnnotationIndex.subiterator(AnnotationFS) method.
 The following is a code fragment:
 AnnotationIndex annotationIndex = jCas.getAnnotationIndex(tokenType);
 FSIterator tokenIterator = annotationIndex.subiterator(sentenceAnnotation);
 tokenIterator.moveTo(tokenAnnotation);
 Here is the relevant portion of the stack trace:
 java.lang.ClassCastException: edu.colorado.cslr.dessert.types.Token
 at java.util.Collections.indexedBinarySearch(Unknown Source)
 at java.util.Collections.binarySearch(Unknown Source)
 at org.apache.uima.cas.impl.Subiterator.moveTo(Subiterator.java:224)
 If I change the second line to the following, then I do not have any problems 
 with an exception being thrown.
 FSIterator tokenIterator = annotationIndex.iterator();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: subtyping uima.cas.String

2007-06-11 Thread Philip Ogren

Attached is a type system descriptor file that isolates the bug.  I 
cannot create a subtype of TestType in the CDE. 


Philip Ogren wrote:
I was just putting some unit tests together and was editing a type 
system and noticed that I can't seem to subtype a type that is a 
subtype of uima.cas.String in the CDE.  I thought this was maybe a bug 
that I should bring attention to.  For my part, I'm not concerned with 
the ability to create nested subtypes of uima.cas.String - so 
apologies if this is noise.
?xml version=1.0 encoding=UTF-8?
typeSystemDescription xmlns=http://uima.apache.org/resourceSpecifier;
nameTestTypeSystem/name
description/description
version1.0/version
vendor/vendor
types
typeDescription
nameTestType/name
description/
supertypeNameuima.cas.String/supertypeName
/typeDescription
typeDescription
nameTestType2/name
description/
supertypeNameuima.tcas.Annotation/supertypeName
/typeDescription
typeDescription
nameTestType3/name
description/
supertypeNameTestType2/supertypeName
/typeDescription
/types
/typeSystemDescription

Re: subtyping uima.cas.String

2007-06-11 Thread Philip Ogren

Sorry for the noise!  A little investigation reveals that this behavior 
is almost certainly by design.  Changing the source by hand gives an 
error message that says don't do that and section 2.3.4 also of the 
UIMA References also documents this.



Philip Ogren wrote:
Attached is a type system descriptor file that isolates the bug.  I 
cannot create a subtype of TestType in the CDE.

Philip Ogren wrote:
I was just putting some unit tests together and was editing a type 
system and noticed that I can't seem to subtype a type that is a 
subtype of uima.cas.String in the CDE.  I thought this was maybe a 
bug that I should bring attention to.  For my part, I'm not concerned 
with the ability to create nested subtypes of uima.cas.String - so 
apologies if this is noise.



?xml version=1.0 encoding=UTF-8?
typeSystemDescription xmlns=http://uima.apache.org/resourceSpecifier;
nameTestTypeSystem/name
description/description
version1.0/version
vendor/vendor
types
typeDescription
nameTestType/name
description/
supertypeNameuima.cas.String/supertypeName
/typeDescription
typeDescription
nameTestType2/name
description/
supertypeNameuima.tcas.Annotation/supertypeName
/typeDescription
typeDescription
nameTestType3/name
description/
supertypeNameTestType2/supertypeName
/typeDescription
/types
/typeSystemDescription
  



No virus found in this incoming message.
Checked by AVG Free Edition. 
Version: 7.5.472 / Virus Database: 269.8.13/843 - Release Date: 6/10/2007 1:39 PM

Re: Human annotation tool for UIMA

2007-06-11 Thread Philip Ogren

Here is a link to the svn repository which has a branch for the 
CasEditor.  I have not found a documentation page for it yet if there is 
one. 


http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/


Andrew Borthwick wrote:

As per the below, could someone please post a link to the CasEditor to
uima-user so we don't all have to go hunting for it?

Thanks,
Andrew Borthwick

On 6/7/07, Philip Ogren [EMAIL PROTECTED] wrote:



 Also note that we have a contribution by Joern Kottmann in the 
sandbox called
 CAS editor.  This is Eclipse based tooling that also allows you 
to manually
 create annotations.  It is still under development, but maybe there 
could

 be some cross-fertilization.


The CAS editor is very hard to find!  You might consider putting a link
on the sandbox page.  I will take a look at this.  If nothing else,
Knowtator might serve as a source of requirements / feature requests for
the CasEditor.  Also, calculating IAA is non-trivial for a variety of
reasons which deserve their own conversation.  Do you have any notion
what Joern's commitment is to the project?

problem creating Index Collection Descriptor File

2007-06-07 Thread Philip Ogren

I have three related questions that I decided to split up into three 
messages.  I composed them as one email initially and decided I could be 
spawning a hard-to-traverse thread.  Advanced apologies for the inundation.


I am trying to create an Index Collection Descriptor File so that I 
can have my index definitions in one place and unable to do so.  I 
dialog comes up requesting a type system descriptor file as a context 
for the descriptor file.  When I specify my type system descriptor file 
and click ok/finish the component descriptor editor gives me an error 
message and a details button.  When I click on the the details button I 
get:


java.lang.ClassCastException: java.io.File
at 
org.apache.uima.taeconfigurator.editors.MultiPageEditor.loadContext(MultiPageEditor.java:1479)
at 
org.apache.uima.taeconfigurator.editors.MultiPageEditor.setFsIndexCollection(MultiPageEditor.java:1505)
...
at org.eclipse.core.launcher.Main.basicRun(Main.java:280)
at org.eclipse.core.launcher.Main.run(Main.java:977)
at org.eclipse.core.launcher.Main.main(Main.java:952)


Am I doing something wrong or is this functionality broken?

Thanks,
Philip

finding annotations relative to other annotations

2007-06-07 Thread Philip Ogren

 Is there any simple way to ask for the token 3 to the left of my 
current token?  I can't find anything that is built into the default 
annotation index, and so I have defined an index for this in the 
descriptor file.  In order to do this I define a feature in my token 
type that keeps track of the index of each token and create an FSIndex 
on that feature.  I can then use the FSIndex as follows:


FSIndex fsIndex = 
jCas.getFSIndexRepository().getIndex(my.token.place.index);

   MyToken targetAnnotation = new MyToken(jCas);

   targetAnnotation.setIndex(targetIndex);  //target index is the index of 
the token I want to find

   return fsIndex.find(targetAnnotation);

Does this make sense?  It works - but I just want to make sure that I am 
not creating extra work for myself.  I actually have several types that 
I want to be able to access this way and so I've created a super-type 
that has the 'index' feature.  This allows me to create a set of utility 
methods that I can use for retrieving e.g. tokens, sentences, 
constituents, etc. at a specific index or relative to other annotations 
of the same type.  The only challenge is making sure I retrieve (and 
manage) the appropriate index when one of these methods is called. 



Any advice here is appreciated.

Thanks,
Philip

Re: Human annotation tool for UIMA

2007-06-07 Thread Philip Ogren

My initial thought was to have a CasConsumer that loads annotations 
directly into Knowtator programmatically, and a CasInitializer that goes 
the other way.  What remains is to have a way to translate/synchronize 
the Type System in UIMA with the class hierarchy / annotation schema in 
Knowtator back and forth.  Both of these tasks should be fairly straight 
forward.  I would rather do it this way than to muck with translating 
file formats. 

Making it possible to go from one to the other is one thing but making 
it fun and easy is another.  Some considerations are:


- Knowtator has extra book keeping information that it associates with 
each annotation such as the annotator, the creation date, and comments

- Knowtator allows multiple spans for a single annotation
- Knowtator allows you to annotate mentions of classes (e.g. person) or 
mentions of instances (George Washington)


... and there are always usability issues, etc. 


see other answers interspersed below.

Thilo Goetz wrote:

Hi Philip,

I downloaded Knowtator and played around with it a bit.  It looks pretty
slick (although I think it could be improved by following some of the more
standard UI conventions, like dialogs with ok and cancel buttons, for
example; though I guess that's Protege, not Knowtator).

  
Yes.  Protege uses non-modal dialogs by design.  There is a lengthy 
explanation 
http://protege.stanford.edu/doc/design/ok_and_cancel_buttons.html of 
this on the Protege website. 


Also note that we have a contribution by Joern Kottmann in the sandbox called
CAS editor.  This is Eclipse based tooling that also allows you to manually
create annotations.  It is still under development, but maybe there could
be some cross-fertilization.

  
The CAS editor is very hard to find!  You might consider putting a link 
on the sandbox page.  I will take a look at this.  If nothing else, 
Knowtator might serve as a source of requirements / feature requests for 
the CasEditor.  Also, calculating IAA is non-trivial for a variety of 
reasons which deserve their own conversation.  Do you have any notion 
what Joern's commitment is to the project?

[I know it's a bit early to talk about legal issues, but please note that
the Mozilla license is incompatible with the Apache license.  If any of your
code were to move to Apache, it would need to be relicensed under ASL 2.0.]

  
We chose MPL because that is what Protege uses. 


Re funding: we are optimistic that there will be more UIMA innovation
awards by IBM this year.  Watch this space for the announcement.  I would
think that a Knowtator/UIMA integration could be a candidate.

  

Great!  We will look forward to seeing the announcement.


A human annotation tool that works well with UIMA would be an important
addition to our ecosystem, so I'm glad this discussion is happening.

BTW, are people familiar with the annotation standards work that is going
on at ISO (http://www.tc37sc4.org/)?  This bears watching as well, as it
might evolve into the annotation standard that everybody's been waiting for.
At least that seems to be the plan of the folks working on it ;-).

--Thilo

Re: Human annotation tool for UIMA

2007-06-05 Thread Philip Ogren

I'm glad I happened to browse the archive today! I just joined the list 
today because I have noticed a couple of bugs that I want to post 
somewhere. So, I developed and maintain Knowtator and am also steeped in 
UIMA technology - I have been using it for just over a year and a half 
now. I would love to make Knowtator tightly integrated with UIMA. The 
frame-based representation that Knowtator uses via Protege (Frames 
edition) is *very* similar to the type system framework in UIMA. The 
Protege frames is actually much more expressive and I would be surprised 
if the representational capability does not completely subsume UIMA's 
type system representational capability. Knowtator can not make use out 
of some of the more advanced representational constructs such as slot 
inheritance (i.e. superslots and subslots, e.g. has-characteristic might 
subsume has-color) or multiple inheritance - but I think it handles 
anything that can be represented in a UIMA type system. I think it would 
be a really nice fit.


A one-off solution as described previously is going to be extremely 
frustrating. Things that you might want to do with an annotation tool 
include:


- calculate inter-annotator agreement in a wide variety of ways
- consolidate/adjudicate disagreement between annotators
- mundane data management tasks
- keep track of who annotated what
- merge sets of annotations together
- stand-alone annotation on a laptop (a perk for annotators if it is a 
part-time job that they can do during hours of their own choosing)

- work on / visualize subsets of annotations
- there are a bajillion user-interface considerations

I'm sure there are many other things I haven't thought of off the top of 
my head. We have been working really hard on Knowtator the last few 
months and have tried to make it much easier to get started with and 
more user friendly. If you have looked at it previously and got 
discouraged, then I encourage you to take a look at the latest version 
and the updated documentation. It is still a clunky with respect to 
importing and exporting annotation data (which we are currently working 
on to make easier) - but having a UIMA solution would make this problem 
go away for this community. We have created some one-off scripts to go 
from one to the other that we could possibly make available with a 
little effort if there is interest.


If you have ideas about how this effort could be funded I would be 
grateful for suggestions. We are considering applying for an Eclipse 
Innovation Award as an appropriate venue but we don't really know what 
the odds are of getting it funded for this work. Or, if you have 
interest in working on this yourself, I would be thrilled to provide 
expertise.


Thanks,
Philip







include
One manual annotation tool that is open source is Knowtator (which is
licensed under MPL 1.1). As I understand it, Knowtator is intended for
manual annotation entities and relationships in text. It is a layer on
top of the Prot�g� open source ontology editor. I'm not really familiar
enough with Knowtator to explicitly recommend it. Considering its
stated goals and the framework that it was developed on, it seems like
it might be particularly well suited to enabling manual annotations for
relatively elaborate type systems that have a lot of structure and many
common relation annotation types. The flip side is that it may be
overkill for the (more common) task of marking up instances of a flat
list of named-entity types. In any event, my point here is just that
anyone who is thinking of building a mapping from an open source manual
annotation tool to UIMA may want to consider Knowtator, especially if
they are interested in a lot of expressive power.

creating a type called 'Feature'

2007-06-05 Thread Philip Ogren

If you create a type with the name 'Feature' you get compile errors 
because of a namespace conflict with the Feature interface.  I think 
this could be easily fixed by simply removing the import statement in 
the generated code and explicitly providing the fully qualified name for 
the Feature interface when referenced (e.g. 
'org.apache.uima.cas.Feature'). 


Please advise me if I am submitting this to the wrong list.

Thanks,
Philip

80 matches

Mail list logo