[
https://issues.apache.org/jira/browse/SOLR-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343602#comment-16343602
]
Kevin Watters commented on SOLR-11838:
--------------------------------------
I'm very excited to see this integration happening. [~gus_heck] has been
working with me on some DL4j projects in particular training models and
evaluating them for classification. I think at a high level there are 3 main
integration patterns that we could / should consider in Solr.
# using a model at ingest time to tag / annotate a record going into the
index. (primary example would be something like sentiment analysis tagging.)
This implies the model was trained and saved somewhere.
# using a solr index (query) to generate a set of training test data so that
DL4j can "fit" the model and train it. (there might even be a desire for some
join functionality so you can join together 2 datasets to create adhoc training
datasets.)
# (this is a bit more out there.) indexing each node of the multi layer
network / computation graph as a document in the index, and use a query to
evaluate the output of the model by traversing the documents in the index to
ultimately come up with a set of relevancy scores for the documents that
represent the output layer of the network.
I think , perhaps, the most interesting use case is #2. So basically, the idea
is you want to define a network (specify the layers, types of layers,
activation function, etc..) and then specify a query that matches a set of
documents in the index that would be used to train that model. Currently DL4j
uses "datavec" to handle all the data normalization prior to going into the
model for training. That exposes a DataSetIterator. The datasetiterator could
be replaced with an iterator that sits ontop of the export handler or even just
a raw search result. The general use cases here for pagination would be
# to iterate the full result set (presumably multiple times as the model will
make multiple passes over the data when training.)
# generate a random ordering of the dataset being returned
# excluding a random (but deterministic?) set of documents from the main query
to provide a holdout testing dataset.
Keeping in mind that typically in network training, you have both your training
dataset and the testing dataset.
The final outcome of this would be a computationgraph/multilayernetwork which
can be serialized by dl4j as a json file, and the other output could/should be
the evaluation or accuracy scores of the model (F1, Accuracy, and confusion
matrix.)
As per the comments about natives, yes, there are definitely platform dependent
parts of DL4j, in particular the "nd4j" which can be gpu/cpu, but there are
also other dependencies on javacv/javacpp. The javacv/javacpp stuff is really
only used for image manipulation as it's the java binding to OpenCV. The
dependency tree for DL4j is rather large, so I think we'll need to take
care/caution that we're not injecting a bunch of conflicting jar files.
Perhaps, if we identify the conflicting jar versions.
> explore supporting Deeplearning4j NeuralNetwork models
> ------------------------------------------------------
>
> Key: SOLR-11838
> URL: https://issues.apache.org/jira/browse/SOLR-11838
> Project: Solr
> Issue Type: New Feature
> Reporter: Christine Poerschke
> Priority: Major
> Attachments: SOLR-11838.patch
>
>
> [~yuyano] wrote in SOLR-11597:
> bq. ... If we think to apply this to more complex neural networks in the
> future, we will need to support layers ...
> [~malcorn_redhat] wrote in SOLR-11597:
> bq. ... In my opinion, if this is a route Solr eventually wants to go, I
> think a better strategy would be to just add a dependency on
> [Deeplearning4j|https://deeplearning4j.org/] ...
> Creating this ticket for the idea to be explored further (if anyone is
> interested in exploring it), complimentary to and independent of the
> SOLR-11597 RankNet related effort.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]