FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
And we also index office documents via Tika (*.doc and similar).
And I think it should not be a feature of the searc
FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.
Nemo
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https:/
I've written about my problem ~2 years ago:
http://wikitech-l.wikimedia.narkive.com/6G0YPmWQ/need-a-way-to-modify-text-before-indexing-was-searchupdate
It seems I've lost the latest message, so I want to answer to it now:
With lsearchd and Elasticsearch, we absolutely wouldn't want to munge
fi
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov wrote:
> SearchEngine subclasses can implement getTextFromContent() if they want to
>> override the normal text fetching behavior.
>>
>
> I can't put it into SearchEngine subclass because Tika isn't a search
> engine, it's rather a java applicatio
SearchEngine subclasses can implement getTextFromContent() if they want
to override the normal text fetching behavior.
I can't put it into SearchEngine subclass because Tika isn't a search
engine, it's rather a java application that runs separately and extracts
text from binary files like *
On Tue, Jan 14, 2014 at 2:33 PM, wrote:
> Hi!
>
> Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
> breaks my TikaMW extension - I used that hook to extract contents from
> binary files so the user can then search on it.
>
> Maybe you can add some other hook for this purp
Hi!
Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
breaks my TikaMW extension - I used that hook to extract contents from
binary files so the user can then search on it.
Maybe you can add some other hook for this purpose?
See also https://github.com/mediawiki4intran