Hi Manoj, The format has been around for a long time. Whereas I don’t think it predates XML, XML was probably not as ubiquitous as it is today. However, it really should not be a stumbling point for you. I believe all you need to do is read in the data and get the spans of the names. One other point, OpenNLP has the concept of a dictionary. Have you looked into openlp.tools.dictionary.Dictionary and openlp.tools.dictionary.DictionarySerializer? It looks like you want to create a DictionarySerializer that can read your format.
One last point, This question is probably better asked on the user listserve. Most of the developers are subscribed to both the user and dev listserves. Hope it helps, Daniel > On Jul 21, 2017, at 6:54 AM, Manoj B. Narayanan > <manojb.narayanan2...@gmail.com> wrote: > > Hi Jim, > Thanks for replying. Could you be more specific please. > > These are the things that I am aware of: > 1. The training data can be of the form <START:person> Pierre Vinken <END> > is a good example . > 2. Currently I use a file in the below format and create a 'Dictionary' > from it. > This is the format > > <entry><token>vinayak</token></entry> >> >> <entry><token>rakesh</token></entry> >> >> <entry><token>sandeep</token></entry> >> >> <entry><token>manoj</token></entry> >> >> > And use this dictionary in the DictionaryNameFinder. > > I would like to know the advantages of using this format. Is there any > other formats available? > > Could you please explain more. > > Thanks. > Manoj > > On Fri, Jul 21, 2017 at 3:56 PM, Jim O'Regan <jaore...@tcd.ie> wrote: > >> 2017-07-19 10:48 GMT+01:00 Manoj B. Narayanan < >> manojb.narayanan2...@gmail.com>: >> >>> Hi all, >>> >>> I wanted to find out if there is any specific reason behind using XML >>> format for dictionaries for Name Finder. >>> >> >> It's not XML. There is a very superficial similarity in the use of <>, but, >> at a minimum >> <START:person> Pierre Vinken <END> >> would need to be something like >> <name type="person"> Pierre Vinken </name> >> and the whole document would need to be enclosed by a pair of tags. >> >> >>> Also, is there any source from where we can get the documentation >> regarding >>> the dictionary formats for various tools (tokenizer, pos, name finder). >>> >> >> The manual: https://opennlp.apache.org/docs/1.8.1/manual/opennlp.html >> More specifically, >> tokeniser: >> https://opennlp.apache.org/docs/1.8.1/manual/opennlp. >> html#tools.tokenizer.training >> pos: >> https://opennlp.apache.org/docs/1.8.1/manual/opennlp. >> html#tools.postagger.training >> name finder: >> https://opennlp.apache.org/docs/1.8.1/manual/opennlp. >> html#tools.namefind.training >>