Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "BaysianMimeTypeSelector" page has been changed by Lukeliush:
https://wiki.apache.org/tika/BaysianMimeTypeSelector?action=diff&rev1=5&rev2=6

  
  [[https://issues.apache.org/jira/browse/TIKA-1517|TIKA-1517 [MIME type 
selection with probability]]]
  
- The motivation is that the current implemenation within MimeTypes for 
detecting mime types in Tika is a bit stiff and less flexible(at the time the 
article is being written, the current version of MimeTypes which has 3 
detection approaches to identify mime types is implemented with a fall-back 
strategy), the detection highly depends on the magic byte detection. T
+ The motivation is that the current implemenation within MimeTypes for 
detecting mime types in Tika is a bit stiff and less flexible(at the time the 
article is being written, the current version of MimeTypes which has 3 
detection approaches to identify mime types is implemented with a fall-back 
strategy), the detection highly depends on the magic byte detection.
  
- he last two approaches (i.e. extension and metatdatahint matching) are 
subsidiary and auxiliary in the final detection decsion. In other words, the 
decision that comes from the last two approach will probablly be considered 
when there is a tie to break in the magic bytes detection as there might be 
multiple mime types estimated by magic bytes method, in this situation file 
extension and metadatahint will be used.
+ The last two approaches (i.e. extension and metatdatahint matching) are 
subsidiary and auxiliary in the final detection decsion. In other words, the 
decision that comes from the last two approach will probablly be considered 
when there is a tie to break in the magic bytes detection as there might be 
multiple mime types estimated by magic bytes method, in this situation file 
extension and metadatahint will be used.
  
  It is also possible that in some situations the type given by the file 
extension and metadata hint matching are more specialized than magic bytes 
method, then the most specialized or specific type gets returned. This 
implementation seems to exhibt a bit inflexibilities in some situations where 
users prefer a particular type of detection e.g. they might only trust or 
prefer their file extensions.
  
- Perhaps, in the future we might have more probablistic mime detection 
algorithms being considered for deploying into Tika, probably from this 
perspective, the current implementation also seems to give less space for 
extending with more detection methods in Tika.
+ Perhaps, in the future we might have more probablistic mime detection 
algorithms being considered for deploying into Tika, probably from this 
perspective, the current implementation also seems to give less space for 
expanding with more detection methods in Tika.
  
  Therefore, it would be great to have a feature like Tika-1517 where user can 
add weights or preference on the detection method they want to use for 
detecitng mime types.
  

Reply via email to