[ 
https://issues.apache.org/jira/browse/SYSTEMML-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Boehm resolved SYSTEMML-452.
-------------------------------------
       Resolution: Fixed
         Assignee: Matthias Boehm
    Fix Version/s: SystemML 0.10

> JMLC API: Support for text analytics usecases
> ---------------------------------------------
>
>                 Key: SYSTEMML-452
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-452
>             Project: SystemML
>          Issue Type: Improvement
>          Components: APIs
>            Reporter: Laura Chiticariu
>            Assignee: Matthias Boehm
>             Fix For: SystemML 0.10
>
>
> I am working on text analytics use case (e.g., document classification, 
> entity extraction).
> I would like to use the JLMC interface at scoring time, but can't find the 
> right method in org.apache.sysml.api.jmlc.PreparedScript.
> For entity extraction, I need features associated with every token in the 
> document. In this case, the features are conceptually represented as a table 
> with 3 columns: 
> - tokenID (Integer) - consecutive integer numbers, representing the position 
> of the token in the document (entity extraction is essentially a problem of 
> classifying every token as Begin_Entity, Inside_Entity, Outside_Entity, and 
> hence the order of tokens in the document is important)
> - featureName (String): name of the feature, for example, whether the token 
> is a capitalized word, or the surface form of the token, etc
> - featureValue (Integer): an integer, in this case always 1 since I do not 
> include features that are absent.
> For document classification, the order of tokens in the document may or may 
> not be important. In the simplest case, assume the order is not important. 
> For each document, we just use the surface form of each token in the document 
> as feature name, and the number of times that surface form appears in that 
> document as feature value. So the features are: conceptually represented as a 
> table with 2 columns: 
> - featureName (String): the surface form of the token
> - featureValue (Integer): the number of times the surface form appears within 
> the document
> Essentially, for both use cases I would like to pass to JMLC a table with a 
> schema, where each column has a known basic datatype (I can think of String, 
> Integer, Float, Boolean). Is this possible ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to