i can successfully retrieve source page from segment with this bin/nutch readseg -dump crawl_folder/segments/segment_folder_name(i dont know how to include all folders so if you tell i appreciate that)/ extract_folder_name -nofetch -nogenerate -noparse -noparsedata -noparsetex
so this brings source code but it still includes html tags like <div id="facebox" style="display:none;"> \ <div class="popup"> \ <table> \ <tbody> \ <tr> \ but i want only text for language detection how i can do that ty