2009/3/20 pranav <pra...@gmail.com>: > Greetings All, > > I have huge number of HTML files, all in english. I also have their > counterpart files in Spanish. The non english files have their look > and feel a little different than their english counterpart. > > My task is to make sure that the English HTML files contain the > Spanish text, with retaining the English look and feel. > ... > > Pranny > -- > http://mail.python.org/mailman/listinfo/python-list >
Hi, I guess, this task can probably not be solved fully automatically unless there is some exact structure of the HTML, but it doesn't seem likely. If you would prefer to work with static sources, you can try to identify the differences in the markup of english and spanish pages. e.g. using BeautifulSoup http://www.crummy.com/software/BeautifulSoup/ or at least approximately with regular expressions, e.g.: tags_only_source = re.findall(r"<[^>]+>", html_source) should return the tags source for simple code (neglecting nesting, commented code, strings containing tags source ...) the difflib library then could help in identifying the differences in code, cf: http://docs.python.org/library/difflib.html >>> for difference in difflib.ndiff("abcadefsdf", "abQcadsdfAA"): print >>> difference ... a b + Q c a d - e - f s d f + A + A (sample strings used here as arguments for ndiff can also be lists of strings returned by findall() above.) If you are lucky and the differences are rather small and regular, you can then try to modify the markup in the spanish pages to be more similar to the english ones; again possibly using BeautifulSoup or even re.sub(...) (of course, saving the modified sources as new files in some other directory) (The opposite - taking the english markup and feeding it with english text - would be more tricky, I guess.) However, all that is likely to help only with the part of the task, which will almost certainly require, more or less "manual" work. Someone more experienced can probably propose a more effective approach... hth, vbr -- http://mail.python.org/mailman/listinfo/python-list