Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=16&rev2=17 '''Motivation''' - The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extracts the knowledge out of the data (bytes). The motivation of this feature is to offer users with an option where content based detection approach can be used, the content can be defined in several ways, they can be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content byte histogram is used. + The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extracts the knowledge out of the data (bytes). The motivation of this feature is to provide users with an option where content based detection approach can be used, the content can be defined in several ways, they can be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content byte histogram is used. Some files are very huge in size, building byte histograms for those files requires significant amount of time, but it is worth noting that with domain specific knowledge or the heuristics (e.g. there might be some crucial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection.
