I agree with all your comments and I would keep the getContentTypes()
method in the TextExtractor interface. I think that way it requires less
configuration because a TextExtractor knows the mime types it can
handle anyway.
How should we handle backward compatibility? Do you think we should keep
the existing TextFilter interface?
regards
marcel
Jukka Zitting wrote:
Hi,
While resolving JCR-470, I started thinking about the design of the
TextFilter interface. There are a number of issues with the interface
that I believe could do with some refactoring:
1) The name of the interface. The name suggests that it's a filter for
an existing text stream. A better name would probably be TextExtractor
or something similar.
2) The canFilter() method. The design forces Jackrabbit to always
iterate through the whole list of configured index filters asking each
whether it can filter a given content type. Wouldn't it be better to
have a getContentTypes() method that would return the supported
content types, so Jackrabbit could just do a map lookup to get to the
correct filter. Alternatively we could completely do away with the
method and use a separate configuration file that maps MIME types to
implementation classes (or even configured instances).
3) The doFilter() method. See below for a breakdown:
3.1) Name of the method. As discussed in point 1 above, extractText()
could be a better name.
3.2) Return value. Why do we return a Map from the method? I see that
it would be possible to return a set of fields for indexing, but none
of the existing filter classes do this and in any case such a feature
feels a bit confusing (what if there's a field name conflict). The
design also forces the implementations to know the FieldNames.FULLTEXT
constant, a small extra dependency. Would it make sense to return just
a Reader instance?
3.3) The PropertyState argument. All of the implementation classes
just extract the binary stream from the first value of the givne
PropertyState instance. I don't see any benefit of passing the entire
state, it just adds complexity and an extra dependency. How about
passing just the InputStream? It would probably also make sense to
pass the actual MIME type along with the possible character encoding.
Put together, my proposal for a replacement interface would be:
public interface TextExtractor {
String[] getContentTypes(); // if any
Reader extractText(InputStream stream, String type, String
encoding);
}
Comments?
BR,
Jukka Zitting