"So I was expecting that if we express the stream.type, we check this type before sending a Request to Solr."
Actually, the desired mime types selected by the output connection are queried by the repository connection, so that document filtering can take place before the document is even fetched. See IOutputConnector.checkMimeTypeIndexable . Karl On Mon, Dec 16, 2013 at 12:11 PM, Alessandro Benedetti < [email protected]> wrote: > Hi guys, > I was investigating on the use of the stream.type parameter that we can > pass to a Solr Connector as an argument. > > Form the wiki : "Tika will automatically attempt to determine the input > document type (word, pdf, etc.) and extract the content appropriately. If > you want, you can explicitly specify a MIME type for Tika wth the > stream.type parameter" . > > So I was expecting that if we express the stream.type, we check this type > before sending a Request to Solr. > In the way that we avoid to send Request for types that are not the wanted > one. > > But in the org.apache.manifoldcf.agents.output.solr.HttpPoster when we add > the content to the ContentStreamUpdateRequest we don't check the type at > all : > > contentStreamUpdateRequest.addContentStream(new > RepositoryDocumentStream(is,length,contentType,contentName)); > > So, if we pass the parameter stream.type=text/plain, and we have one > content that is video/mp4 we expect to not send that ( maybe is 1 Gb long > and can cause problems) . > > What do you think ? Should we put a control on the type before sending the > content ? > Am i missing something ? > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England >
