Hi all,
that is a discussion which started in this jira issue:
https://issues.apache.org/jira/browse/OPENNLP-99
Steven proposed to use Iterators instead of a stream like interface.
The current status is that we have an EventStream inside maxent which
is made like an iterator, but does not implement the java.util.Iterator
interface.
And in the tools project we came up with the ObjectStream which is inspired
by the InputStream class but deals with objects instead of bytes. It has
the following methods, read, reset and close.
And the current plan is to also use an ObjectStream like interface as a
replacement for
EventStream, but we never got this finished.
Ok, in my opinion we cannot implement the java.util.Iterator interface,
because
java.util.Iterators do not allows us to implement the error handling
with checked
exceptions nicely. I personally also believe that java.util.Iterator
communicates that it can just
be used without worrying about any sever issues like I/O errors.
In order to use such an Iterator with an for each statement, the only
option we have
is to throw unchecked exceptions. Which I believe is uncommon and
unexpected to most people
who read the code. The javadoc would of course document that, but
it would be easy to forget about, checked exceptions cannot be ignored
because the
compiler forces the programmer to handle them.
It is simply a fact that the data we read for training must come from
somewhere,
somewhere is usually the disk, or from some other storage system. Depending
on the source (if its not from memory) the user has to deal with certain
errors and also
needs to free resources again.
In Java that means that the data is usually retrieved via Readers or
InputStream, both
classes which should usually only be used with a try-catch-finally statement
to ensure that in case of an error the underlying resources can be released.
Using an Iterator with unchecked exceptions would mean to somehow hide that
from the user, using checked exceptions forces the user to deal with it
hopefully
correctly.
And there are more good reasons why our ObjectStream isn't bad at all,
it can easily implemented and used in a thread safe way, which is harder
for an iterator like interface. Because the calls to next and hasNext are
of course not atomic, but a call to read can be atomic.
A composed stream could look like this:
1. PlainTextByLineStream
2. LineParsingStream (creates a sample object out the string line)
3. FeatureGenerationStream
4. Multi-threaded data indexer
The data indexer wants to call the read method of the composed stream from
multiple threads to pull in the training Events faster.
To make this thread safe the PlainTextByLineStream.read method would by
synchronized,
LineParsingStream.read is safe when it only calls the underlying read
and does everything
else in its stack. Same story for the feature generation stream.
When you want to do something like this with an iterator style interface
it is harder to
get it thread safe, because the state can change after hasNext was
called, which would mean
that more locking must be used.
In the end I simply think that Iterators are good if you do not have to
deal with errors and underlying
OS resources, and streams are the java way when you sadly have to take
all this into account.
Using an Iterator for all this just to be able to use an for each sounds
for me like a design which
is made to be abused to circumvent important error handling.
Jörn