Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "PooledTimeSeriesParser" page has been changed by ChrisMattmann:
https://wiki.apache.org/tika/PooledTimeSeriesParser?action=diff&rev1=1&rev2=2

+ The [[http://arxiv.org/abs/1412.6505|Pooled Time Series algorithm]] was 
developed by [[http://michaelryoo.com/jpl-interaction.html|Michael Ryoo]] and 
it allows for video descriptors to be considered over time and in this 
consideration for videos to be compared based on the activity going on in their 
scenes. In short, Pooled Time Series is a video comparison metric. An ALv2 
licensed version of the  
[[https://github.com/chrismattmann/pooled_time_series|Pooled Time Series code]] 
is available for use in computing Histogram of Oriented Gradients (HOG) and 
Histogram of Optical Flows (HOF) which can be useful extracted data and 
metadata for a Tika Parser.
- [[http://michaelryoo.com/jpl-interaction.html]]
- [[https://github.com/chrismattmann/pooled_time_series]]
- [[http://arxiv.org/pdf/1412.6505v2.pdf]]
  
- ==== Metadata Representation ====
+ = Metadata Representation =
  
+ The ultimate goal of the project is to be able to extract metadata and data 
from videos and to index that information inside of a searh engine like Apache 
Solr. Videos, like images, are just numbers - or a ordered sequence of number - 
or matrices. There are many ways in which these numbers can be defined. Some 
popular visual descriptors are Histogram of Gradients, Optical Flow vectors, 
RGB or Color Histograms. The challenge is to figure out a way to map this 
datatype to a datatype that can be understood by Solr. In the case of color 
based histograms, we can convert the image into a matrix of hex values, where 
each hex value is the pixel color value
- The ultimate goal of the project is to be able to extract metadata from 
videos and index it inside Solr.
- 
- Videos, like images, are just numbers - or a ordered sequence of number - or 
matrices.
- 
- There are many ways in which these numbers can be defined.
- Some popular visual descriptors are Histogram of Gradients, Optical Flow 
vectors, RGB or Color Histograms.
- The challenge is to figure out a way to map this datatype to a datatype that 
can be understood by Solr.
- 
- In the case of color based histograms, we can convert the image into a matrix 
of hex values, where each hex value is the pixel color value
  and index that as a text_ws field in Solr.
  
+ = Some Related Efforts =
- This is what ShutterStock did with respect to an image search tool they've 
built
- 
https://lucidworks.com/blog/shutterstock-searches-35-million-images-color-using-apache-solr/
  
- Another idea I was thinking of was to index the data as a XHTML document of 
table values,
+ ShutterStock developed an 
[[https://lucidworks.com/blog/shutterstock-searches-35-million-images-color-using-apache-solr/|image
 search tool]] using a similar approach.
  
- where each <tr>..</tr> would be a row of the feature matrix and <td> would be 
the corresponding element in that column.
+ = Representation of output data =
  
- However, while performing ranking or querying we would have to compute a 
distance function on these values (for the dataset and the query video)
+ The data output from the Pooled Time Series parser is an XHTML document of 
table values, where each <tr>..</tr> would be a row of the feature matrix and 
<td> would be the corresponding element in that column. When using a search 
engine like Apache Solr to do ranking or querying we can to compute a distance 
function on these values (for the dataset and the query video), such as 
Chi-Squared, which is what the pooled time series algorithm does.
  
+ A Tika Parser has been developed that implements the Pooled Time Series 
algorithm above and that outputs the HOF and HOG data from videos for use in 
later processing and indexing. Read on below to install and use it!
- How have other users solved this problem? There must be instances of matrix 
type data showing up in other domains, 
- such as geography, physics and other scientific domains. How is the metadata 
designed in such cases?
  
+ = Pre-requisites =
+ 
+ == Install Pooled Time Series ==
+ 

Reply via email to