Jeff Zemerick created OPENNLP-1363:
--------------------------------------
Summary: Verify the documentation of the lemmatizer input format
Key: OPENNLP-1363
URL: https://issues.apache.org/jira/browse/OPENNLP-1363
Project: OpenNLP
Issue Type: Task
Components: Documentation
Reporter: Jeff Zemerick
In OPENNLP-1257, a change was proposed to update the code to split the
lemmatizer input by spaces instead of by tab. I believe tab is the desired
delimiter but we need to make sure the documentation is consistent.
Refer to
[https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.]
, in particular the following sentences:
"The training data consist of three columns separated by spaces. Each word has
been put on a separate line and there is an empty line after each sentence. The
first column contains the current word, the second its part-of-speech tag and
the third its lemma. Here is an example of the file format:"
Determine if that first line should read "separated by tabs" instead.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)