Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "AdityaDhulipala" page has been changed by AdityaDhulipala: https://wiki.apache.org/tika/AdityaDhulipala?action=diff&rev1=1&rev2=2 [[https://github.com/chrismattmann/pooled_time_series]] [[http://arxiv.org/pdf/1412.6505v2.pdf]] + ==== Metadata Representation ==== + + The ultimate goal of the project is to be able to extract metadata from videos and index it inside Solr. + + Videos, like images, are just numbers - or a ordered sequence of number - or matrices. + + There are many ways in which these numbers can be defined. + Some popular visual descriptors are Histogram of Gradients, Optical Flow vectors, RGB or Color Histograms. + The challenge is to figure out a way to map this datatype to a datatype that can be understood by Solr. + + In the case of color based histograms, we can convert the image into a matrix of hex values, where each hex value is the pixel color value + and index that as a text_ws field in Solr. + + This is what ShutterStock did with respect to an image search tool they've built + https://lucidworks.com/blog/shutterstock-searches-35-million-images-color-using-apache-solr/ + + Another idea I was thinking of was to index the data as a XHTML document of table values, + + where each <tr>..</tr> would be a row of the feature matrix and <td> would be the corresponding element in that column. + + However, while performing ranking or querying we would have to compute a distance function on these values (for the dataset and the query video) + + How have other users solved this problem? There must be instances of matrix type data showing up in other domains, + such as geography, physics and other scientific domains. How is the metadata designed in such cases? + + ---- CategoryHomepage
