yifeim commented on issue #15428: Dataloader does not support sparse data
URL: 
https://github.com/apache/incubator-mxnet/issues/15428#issuecomment-508835732
 
 
   The vanilla sparse format lacks sufficient information for e.g., 
recommendation applications. There are many extensions on group-wise ranking 
loss, other field identifiers, and other pipe marks. Here are some examples:
   
   1. Group-wise ranking loss
   
   vw allows auxiliary labels and [shared information among 
groups](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-algorithms)
   ```
   shared | s_1 s_2
   0:1.0:0.5 | a:1 b:1 c:1
   | a:0.5 b:2 c:1
   ```
   
   xgboost allows a [`.group` 
file](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#group-input-format)
 to count how many rows belong to one ranking group
   ```
   2
   3
   ```
   
   2. Multi-field features
   
   libffm uses [multiple 
columns](https://github.com/ycjuan/libffm/blob/master/README#L116)
   ```
   <label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
   ```
   
   vw uses [multiple 
pipes](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/input-format)
   ```
   1 1.0 |MetricFeatures:3.28 height:1.5 length:2.0 |Says black with white 
stripes |OtherFeatures NumberOfLegs:4.0 HasStripes
   1 1.0 zebra|MetricFeatures:3.28 height:1.5 length:2.0 |Says black with white 
stripes |OtherFeatures NumberOfLegs:4.0 HasStripes
   ```
   
   3. Other delimiters in open-source datasets, e.g., [Criteo counterfactual 
analysis challenge](https://arxiv.org/abs/1612.00367) is similar to the vw 
format, but uses space as delimiters.
   ```
   example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} 
${nbCandidates} ${displayFeat1}:${v 1} ...
   ${wasProduct1Clicked} exid:${exID} ${productFeat1 1}:${v1 1} ...
   ```
   
   It is rather difficult to enumerate all the cases, so I would recommend 
allowing more flexibility, e,g, with a regex format for the parser.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to