hi Thorsten , Thx. It's great that you spot the same issue, and I have exactly the same thought about the API.
public interface Parser I'm thinking about what's the purpose of Parser, and what exact is the different between a Parser and a Handler, given that we could (and probably need to) parse the Content Entity input stream in the handler. It was actually mentioned in an JIRA issue ( https://issues.apache.org/jira/browse/DROIDS-11) about separating the outlink parsing to an Extractor. afaik: - By default, the parser does nothing more than extracting the outlinks with NekoHTML's SAX Parser. The SAX Parser is event-based and does not store any parsed results. And when any handler need to access the content, it will need to parse again. - for me, I use Jericho HTML Parser, which does do a parsing and then store some parsed data. so in the Droids model, I expect I should implement my parser with Jericho HTML, and store the parsed data. When there are multiple handler, all of them could share the same parsed results. in fact, if i have only one handler, there is no different for me to do my parsing and handling in the handler. And as I have implemented my own parsing anyway, the original outlink extraction could be skipped and there won't be duplicated parsing. - For the original case, I wonder if the NekoHTML SAX Parser should be stored in the parse(d) data without the link extraction content handler. So the handler still need to call "parse()" again but it needs not to construct a NekoHTML SAX Parser. If any one use DOM parser, for sure the original SAX parsing logic should be skipped and the DOM tree could stored for the handler. - There are some minor comments to the API as follows: - it's good to merge Parse and ParseData. The meaning of "Parse" isn't too clear. ParseData is more meaningful. And ParsedData or ParseResult is more clear to me. - I suggest to write some lines in the class comment to mention the design purpose of these classes. - If the Parse/ParseData also store a reference of the Parser, for SAX Parser, it could be re-used by the handler. (however, for DOM parser, it's confusing, as it should store the parsed data only) - It seems to me Parser.getParse should be Paser.parse() as it is to trigger an action rather than getting the parse definition. (or Parse.getData() -> Parse.parse()) - re. Object getParseObject(); , I suggest to call it Object getData instead. btw, my understanding of Droids is largely come from the SimpleRuntime usage. I hope i didn't miss the big picture. regards, mingfai On Fri, Mar 27, 2009 at 11:17 PM, Thorsten Scherler <[email protected]>wrote: > On Thu, 2009-03-26 at 22:40 +0800, Mingfai wrote: > > hi, > > > > Thanks for creating this very useful project. > > :) > > Thanks for this nice feedback. > > > > > I'm new to the droids, and have just learnt most of the concepts and able > to > > write custom parser, filter, handler etc. And I have encountered a use > case > > that i want to parse and store some custom data in the Parse/ParseData, > and > > have the custom data available in the handler. > > > > We actually discussed this before but I am not sure whether it was here > or still on the labs list. Bottom line that we do not to rethink the API > around that. To begin with the API has an import to an implementation > class (ParseData) which is just a bad idea. Further like you pointed out > it may make sense to a allow Object to allow custom objects. > > > Take an hypothetical example, assume I have a crawler that run on > Google's > > search result, the parser parse the a result page and extract 10 links > > together with the 10 cache links. In the Droids framework, there is no > way > > to pass the cache links to the handle, right? > > Actually since they are links and if they are not excluded in the > regex-urlfilter.txt they would enter as "normal" link/task > > > As a workaround, i could just > > use a singleton to store a map of data using the uri as the key, but it > > seems to me it is better if the ParseData could store more than the > outlinks > > but also some custom data that we use. What do you think? > > How about > public interface Parse { > Object getObject(); > ParseData getParseData(); > } > > would merge with ParseData like > public interface Parse { > Object getParseObject(); > Collection<Link> getOutlinks() ; > } > > This way we can reduce the level of depth in the API and make the > relation clearer. We may even think about merging Parser and Parse too. > > WDYT? > > salu2 > > > > > The implementation could be very simple, just store a Map > > > > Regards, > > mingfai > -- > Thorsten Scherler <thorsten.at.apache.org> > Open Source <consulting, training and solutions> > >
