Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "ContentMimeDetection" page has been changed by Lukeliush: https://wiki.apache.org/tika/ContentMimeDetection?action=diff&rev1=2&rev2=3 - The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extract the knowledege out of the data (bytes). The motivation of this feature is to offer users with an option where content based detection approach can be used, the content can be defined in several ways, theycan be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content-byte histogram is used, and the Content based detection analyses the byte histogram patterns (i.e. content). Some files are very huge in size, building byte histograms for those files requires significant amount time, but it is worth noting that with domain specific knowledge or the heurstics (e.g. there might be some crutial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection. + The feature of TIKA-1582 is an extension of TIKA MIME detection based on file contents, i.e. the file byte histograms, and this feature follows a standard data mining process that extracts the knowledge out of the data (bytes). The motivation of this feature is to offer users with an option where content based detection approach can be used, the content can be defined in several ways, they can be the entire file bytes, byte n-grams, byte histograms, etc. In this feature, the content byte histogram is used. + + Some files are very huge in size, building byte histograms for those files requires significant amount time, but it is worth noting that with domain specific knowledge or the heuristics (e.g. there might be some crucial and critical regions in the file that could help with the detection.), we can further reduce the amount of effort required for knowledge discovery or mining particular patterns that we can use in the type detection. Please also note, this content based mime detection does require users to have some knowledge with data mining and machine learning, and the choice of learning algorithms used in the pattern mining does not seem to matter, the knowledge to be mined is actually the classification, and there are many classification learning algorithms invented or revented, the question of which one is the best depends on a goal and data, each of the learning algorithms requires lots of effort for performance testing, and some data might be linear sepeartable, some are not; and a or a set of goals is very important as it often is in the context of performance tuning; we can also think about it as a performance tuning problem where we need to set a set of goals in terms of the scalability, complexity or accuracy, so we want to leave the choice of algorithms to users based on their goals and data in their enviroment. As an example, we have actually implemented two algorithms for mining patterns with the GRB file types, one is linear logistic regression and the other is neural network. Again, the neural network with back-propagation is a bit more complex with training, and logistic regression is far cheaper in terms of complexity, and it turns out that logistic regression also gives a good result with high accuracy, and it is worthy noting that it is always better to circumscribe the mime types to be detected; in the example model we have built, we attempt to classify grb files from non-grb files, and one of the challenges is to identify the non-grb file types whose class can be enormously large, the best practice is to circumscribe a set of types to be classified, again domain specific knowledge come into the play for well-defining a set of types in the user specific enviroment.
