Hi, The crawls themselves are run and defined by the organisation at http://commoncrawl.org/, see also http://commoncrawl.org/connect/contact-us/, we then consume the resulting freely available data.
Dominik. On Tue, Oct 9, 2018 at 7:35 AM Dave Fisher <[email protected]> wrote: > I’m following this as part of my talk at COSCON which I plan to include > common crawler. > > Who is in charge of where the crawler is pointed and how would one ask for > additional URLs? > > Regards, > Dave > > Sent from my iPhone > > > On Oct 8, 2018, at 10:31 PM, Dominik Stadler <[email protected]> > wrote: > > > > Hi Andi, > > > > I have now executed the CommonCrawlDownload-tool on crawl 2018-30, only > 144 > > files did match by extension, I have collected them at > > https://www.dropbox.com/s/w3sxnb5l3er3kdq/downloadEMF.zip?dl=0 however > many > > are actually some HTML, mostly redirects. > > > > 5hwaterwiki2011.wikispaces.com_file_links_parana_river_wordart.emf: > > empty > > 5hwaterwiki2011.wikispaces.com_file_view_parana_river_wordart.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > apache.org_foundation_press_kit_asf_logo_wide.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > apkpure.com_emf-fitness_com.technogym.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > apkpure.com_eye-monster-invasion-free_com.abula.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > appraiser77.ru_adds_nechaev_rakova_2.files_image001.emz: > > gzip compressed data, max compression, from NTFS filesystem (NT) > > atlantabusinessnetwork.org_newsletter_july_image014.emz: > > ASCII text, with no line terminators > > atlantabusinessnetwork.org_newsletter_july_image015.emz: > > ASCII text, with no line terminators > > caicedo.wikispaces.com_file_history_imagen1.emf: > > empty > > caicedo.wikispaces.com_file_links_imagen1.emf: > > empty > > chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf: > > HTML document, ASCII text > > demaret.se_demaret060725.emf: > > HTML document, ASCII text > > demaret.se_demaret5.emf: > > HTML document, ASCII text > > downtowntactical.com_brand.emf: > > empty > > encyclopedia2.thefreedictionary.com_.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF line > > terminators > > extension.sophia-it.com_content_.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > faculty.ksu.edu.sa_a-alkathiri_publishingimages_link.emf: > > empty > > festivales.wikispaces.com_file_links_dia_de_los_madres.emf: > > empty > > informationforsurvey.com_powerprocessplant_image002.emz: > > ASCII text, with no line terminators > > iranapps.ir_app_com.superphunlabs.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > irunguns.com_brand.emf: > > empty > > itec-int.co.jp_isop_users_images_giziroku.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > itec-int.co.jp_isop_users_images_isopzirei.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > itec-int.co.jp_isop_users_images_nipou.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > itec-int.co.jp_isop_users_images_syuuhouzirei.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > itec-int.co.jp_isop_users_images_toukou.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > > javalibs.com_artifact_org.eclipse.incquery_org.eclipse.incquery.patternlanguage.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > javalibs.com_artifact_org.eclipse_org.eclipse.wst.common.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > javalibs.com_artifact_org.eclipse.xpand_org.eclipse.xtend.typesystem.emf: > > HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF > line > > terminators > > karayan.net_sourin_furusato1.files_image023.emz: > > gzip compressed data, max compression, from NTFS filesystem (NT) > > kentuckypawninc.com_brand.emf: > > empty > > llmotivation.wikispaces.com_file_links_picture1.emf: > > empty > > media.community.dell.com_zh_images_1000.4.2512.itroom.emf: > > HTML document, ASCII text, with CRLF line terminators > > media.community.dell.com_zh_images_1000.4.2513.itroom.emf: > > HTML document, ASCII text, with CRLF line terminators > > media.community.dell.com_zh_images_1000.4.8165.step2.emf: > > HTML document, ASCII text, with CRLF line terminators > > media.community.dell.com_zh_images_1000.4.8166.step3.emf: > > HTML document, ASCII text, with CRLF line terminators > > media.community.dell.com_zh_images_1000.4.8168.step2.emf: > > HTML document, ASCII text, with CRLF line terminators > > media.community.dell.com_zh_images_1000.5.2515.itroom.emf: > > HTML document, ASCII text, with CRLF line terminators > > mineralesygemas.com_index_archivos_image163.emz: > > HTML document, ASCII text > > mvnrepository.com_artifact_org.eclipse.emf: > > HTML document, UTF-8 Unicode text, with very long lines > > nienaltowski.net_drzewo_20r.nienaltowski.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > nsw.hia.com.au_images_last_20chance_20for_20tickets.emf: > > HTML document, ASCII text, with CRLF line terminators > > play.google.com_store_apps_details_id=switches.emf: > > HTML document, ASCII text, with very long lines > > prstv.ru_logo_prstv_logo_01.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > rightswebquest.wikispaces.com_file_history_picture1.emf: > > empty > > rightswebquest.wikispaces.com_file_links_picture1.emf: > > empty > > saf.bio.caltech.edu_ppt_g_p_i_p2i_rotated_images_ppt.emf: > > XML 1.0 document, ASCII text > > school22.irkutsk.ru_gimn.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > sipoc.wikispaces.com_file_history_mujer_joven-vieja.emf: > > empty > > sipoc.wikispaces.com_file_links_mujer_joven-vieja.emf: > > empty > > sipoc.wikispaces.com_file_view_mujer_joven-vieja.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > stillbeziehungen.tk_whitedrop_ammentriskele.emf: > > HTML document, ASCII text > > stillbeziehungen.tk_whitedrop_erolaclo2.emf: > > HTML document, ASCII text > > stillbeziehungen.tk_whitedrop_erolaclo.emf: > > HTML document, ASCII text > > sureshotguns.com_brand.emf: > > empty > > > thcs-qcong.quangdien.thuathienhue.edu.vn_imgs_thu_muc_he_thong__nam_2013_picture1.emf: > > HTML document, ASCII text > > > webgerman.com_presentations_animationresearchreport_files_slide0009_image033.emz: > > HTML document, ASCII text, with no line terminators > > wikitext.transvivid.ch_iframes_image001.emz: > > HTML document, ASCII text, with CRLF, LF line terminators > > > working-memory-and-education.wikispaces.com_file_view_ld_and_wm_chart.emf: > > HTML document, ASCII text, with very long lines > > www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image007.emz: > > HTML document, ASCII text > > www.aibi.ph_htm_oldharvest_leaven-like-evangelism_files_image009.emz: > > HTML document, ASCII text > > www.appbrain.com_app_emf-sensor_com.codebros.emf: > > HTML document, UTF-8 Unicode text, with very long lines > > www.chisinau.md_public_files_primaria_info_utila_rezerva_cmc.emf: > > HTML document, ASCII text > > www.drugfuture.com_chemdata_stremf_iminodisuccinic-acid.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > www.eclipse.org_projects_project-plan.php_projectid=modeling.emf: > > HTML document, ASCII text, with very long lines > > www.extremedemocracy.com_information_20_26_20values.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > > www.goldsim.com_downloads_library_images_logos_symbol_fullname_blackgold.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > www.goldsim.com_downloads_library_images_logos_symbol_name_blackgold.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > www.goldsim.com_downloads_library_images_logos_symbol_whitegold.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > www.keywild.com_six_mountains_rullers_ruller_12_inches.emf: > > HTML document, ASCII text, with no line terminators > > www.keywild.com_six_mountains_rullers_ruller_6_inches.emf: > > HTML document, ASCII text, with no line terminators > > www.otsucci.or.jp_public_keikyo_keikyo31_image005.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo31_image007.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo32_image011.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo32_image019.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo34_image005.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo34_image009.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo35_image003.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo35_image011.emz: > > HTML document, ASCII text > > www.otsucci.or.jp_public_keikyo_keikyo35_image015.emz: > > HTML document, ASCII text > > www.rogerblench.info_language_afroasiatic_aaop_files_image003.emz: > > gzip compressed data, max compression, from NTFS filesystem (NT) > > www.tulaed-union.ru_111_chislenniy_20sostav.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > www.wiki.ciu20.org_file_history_background4.emf: > > empty > > www.wiki.ciu20.org_file_links_background4.emf: > > empty > > zakon4.rada.gov.ua_laws_file_imgs_21_p416467n73-2.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n103-17.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n110-19.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n112-20.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n113-21.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n122-27.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n88-8.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n93-13.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zakon4.rada.gov.ua_laws_file_imgs_24_p416467n95-14.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but__d0_92-28-1-500-_d0_93_d0_a0.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-28-2-500-_d0_a4_d0_af.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_9e_d0_9a.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-28-7-250-_d0_a4_d0_96.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-28-7-375-_d0_a0_d0_94.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-31-4-500-_d0_a2_d0_a3.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-31-4-700-_d0_a2_d0_a3.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92-5-500-_d0_9b_d0_92.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92_d0_9a_d0_9f-330-_d0_a1.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-1000-_d0_a1_d0_a0.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_92_d0_bd-28-8-500-_d0_a1_d0_a0.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_94-1-500-_d0_91_d0_9a.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_91_d0_95.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-1000-_d0_9f_d0_92.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_95.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_91_d0_ae.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-500-_d0_9f_d0_92.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_91_d0_ae.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > > zavodsvet.ru_production_files_but__d0_9a_d0_9f_d0_9c-30-750-_d0_9f_d0_92.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but__d0_9f-8-500-_d0_a4_d0_9d.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but_kpd-1-500-bk.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-1000-be.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-1000-pv.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-500-be.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-500-byu.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-500-pv.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-750-byu.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_kpm-30-750-pv.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_p-8-500-fn.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-28-1-500-gr.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-28-2-500-fya.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-28-7-250-fzh.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-28-7-250-ok.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-28-7-375-rd.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-31-4-500-tu.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-31-4-700-tu.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_v-5-500-lv.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_vkp-330-s.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_vn-28-8-1000-sr.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_vn-28-8-500-sr.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-375.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but_xxi-_d0_92-28-7-500.emf: > > HTML document, ISO-8859 text, with CRLF line terminators > > zavodsvet.ru_production_files_but_xxi-v-28-7-375.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > zavodsvet.ru_production_files_but_xxi-v-28-7-500.emf: > > Windows Enhanced Metafile (EMF) image data version 0x10000 > > > > > > Dominik. > > > >> On Mon, Oct 8, 2018 at 1:37 PM Tim Allison <[email protected]> wrote: > >> > >> At some point I extracted all emfs from our corpus. I’ll see if that > data > >> is still around and/or re-extract...prob have time tomorrow/ Wednesday > >> > >> On Sun, Oct 7, 2018 at 5:01 PM Dominik Stadler <[email protected]> > >> wrote: > >> > >>> Hi Andi > >>> > >>> It is easy to change CommonCrawlDocumentDownload to fetch other > >> mime-types, > >>> see > >>> > https://github.com/centic9/CommonCrawlDocumentDownload/tree/download_emf > >>> > >>> However .emf files don't appear in the top-100 mimetypes of the crawls > >> and > >>> thus are likely very rarely included if at all. I started a > download-run, > >>> but the first two of the 300 index-files do not contain any matching > >>> extension or mime-type. > >>> > >>> See https://commoncrawl.github.io/cc-crawl-statistics/plots/mimetypes > >> for > >>> mimetype-statistics in the crawl. > >>> > >>> Dominik. > >>> > >>> On Sat, Oct 6, 2018 at 8:14 PM Andreas Beeker <[email protected]> > >>> wrote: > >>> > >>>> Hi Tim / Dominik, > >>>> > >>>> please give me a few pointers, how I could access a pool of EMF files, > >>>> e.g. (not only) within the common crawl corpus. My focus is currently > >> on > >>>> rendering, but as I extend the supported records, I also like to > >> validate > >>>> the parsing. > >>>> As the EMF parsing is relatively new, you still might have a corpus > for > >>>> it, Tim? > >>>> > >>>> I have a few old mails about the common crawl corpus [2], but I guess > >>>> there has been some restructuring taken place and there might be an > >>> easier > >>>> option than downloading the whole index. > >>>> > >>>> Of course office files which I parse for embedded EMFs are also ok. > >>>> > >>>> I have to admit, that I haven't yet tested Dominiks tool [1]. > >>>> > >>>> Alternatively I can use the govdocs1 corpus [3] > >>>> > >>>> Best wishes, > >>>> Andi > >>>> > >>>> > >>>> [1] https://github.com/centic9/CommonCrawlDocumentDownload > >>>> > >>>> [2] > >>>> > >>> > >> > http://apache-poi.1045710.n5.nabble.com/Using-CommonCrawl-for-POI-regression-mass-testing-td5721585.html > >>>> > >>>> [3] > >> http://downloads.digitalcorpora.org/corpora/files/govdocs1/by_type/ > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [email protected] > >>>> For additional commands, e-mail: [email protected] > >>>> > >>>> > >>> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
