[
https://issues.apache.org/jira/browse/SYSTEMML-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Boehm resolved SYSTEMML-452.
-------------------------------------
Resolution: Fixed
Assignee: Matthias Boehm
Fix Version/s: SystemML 0.10
> JMLC API: Support for text analytics usecases
> ---------------------------------------------
>
> Key: SYSTEMML-452
> URL: https://issues.apache.org/jira/browse/SYSTEMML-452
> Project: SystemML
> Issue Type: Improvement
> Components: APIs
> Reporter: Laura Chiticariu
> Assignee: Matthias Boehm
> Fix For: SystemML 0.10
>
>
> I am working on text analytics use case (e.g., document classification,
> entity extraction).
> I would like to use the JLMC interface at scoring time, but can't find the
> right method in org.apache.sysml.api.jmlc.PreparedScript.
> For entity extraction, I need features associated with every token in the
> document. In this case, the features are conceptually represented as a table
> with 3 columns:
> - tokenID (Integer) - consecutive integer numbers, representing the position
> of the token in the document (entity extraction is essentially a problem of
> classifying every token as Begin_Entity, Inside_Entity, Outside_Entity, and
> hence the order of tokens in the document is important)
> - featureName (String): name of the feature, for example, whether the token
> is a capitalized word, or the surface form of the token, etc
> - featureValue (Integer): an integer, in this case always 1 since I do not
> include features that are absent.
> For document classification, the order of tokens in the document may or may
> not be important. In the simplest case, assume the order is not important.
> For each document, we just use the surface form of each token in the document
> as feature name, and the number of times that surface form appears in that
> document as feature value. So the features are: conceptually represented as a
> table with 2 columns:
> - featureName (String): the surface form of the token
> - featureValue (Integer): the number of times the surface form appears within
> the document
> Essentially, for both use cases I would like to pass to JMLC a table with a
> schema, where each column has a known basic datatype (I can think of String,
> Integer, Float, Boolean). Is this possible ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)